MAXIMUS  INC  MCLEAN  VA  F/G  12/1 

FURTHER  RESEARCH  INTO  A  NON-PARAMETRIC  STATISTICAL  SCREENING  SY--ETCCU) 
DEC  79 

ML 


MAXIMUS 


\ 


Prepared  for: 

Office  of  Naval  Research 
Department  of  the  Navy 


DT1C 

SELECTE 

FEB  2  7  1980 


Q 


FURTHER  RESEARCH  INTO  A 

NON-PARAMETRIC  STATISTICAL 
■&- - 7*r. - ^ - 

'  SCREENING  SYSTEM,. 


December  14,  1979 


Funds  for  this  project  were  supported  by  the 
Statistics  and  Probability  Program,  Office 
of  Naval  Research  under  Contract  NR  042-401. 


Thi*  document  has  b«en  ap 
ior  public  ro’-wrond  sole: 
ftiah-itnition  is  unlimited. 


)proved_ 

;  it* 


MAXIMUS,  Inc. 

6723  Whittier  Avenue 
McLean,  Virginia  22101 
(703)734-0050 


39 


80  2  5  C64 


MAXIMUS 


> 


ABSTRACT 


A  new  statistical  technique  is  introduced  for 
screening  a  population  on  the  basis  of  their  observed 
characteristics.  The  technique  treats  nominal  indepen¬ 
dent  variables  with  a  binary  dependent  variable. 
Different  objective  functions  are  specified  for 
constructing  different  decision  rules.  A  recommended 
decision  rule  results  from  achieving  a  proportionate 
reduction  in  error  (PRE),  by  using  information  in  the 
independent  variables  rather  than  just  the  dependent 
variables.  Current  approaches  to  screening  are 
compared  using  five  desirable  properties  postulated 
for  decision  rules.  A  Monte  Carlo  simulation  approach 
is  used  to  construct  decision  rules  using  Boolean 
operators.  Finally,  a  General  Sequential  Algorithm 
is  presented. 


V 


V 


MAXIMUS 


TABLE  OF  CONTENTS 


Page 


INTRODUCTION . 


1.1  Statistical  Screening .  1-1 

1.2  Considerations  in  Screening .  1-2 

1.3  Research  Questions .  1-3 

1.4  Overview  of  the  Report .  1-4 


PROBLEM  FRAMEWORK .  II-l 

2.1  Introduction .  II-l 

2.2  Notation  and  Assumptions .  II-l 

2.3  Objective  Functions .  II-3 

2.4  Screening  and  Prediction  Logic .  11-10 

2.5  Statistical  Inference .  11-20 

2.6  Statistical  Inference  for  Ex  Post 

Analysis .  11-24 

2.7  Summary .  11-27 


REVIEW  OF  OTHER  TECHNIQUES .  III-l 

3.1  Linear  Discriminant  Function .  III-l 

3.2  Multiple  Regression  Analysis .  III-8 

3.3  Logit/Probit  Analysis .  III-10 

3.4  The  Multinomial  Model .  III-13 

3.5  Automatic  Interaction  Detection  (AID)...  III-16 

3.6  Summary .  III-19 


MAXIMUS 


r 

TABLE  OF  CONTENTS 
(Continued) 

IV.  THE  MONTE  CARLO  APPROACH . 

4.1  Background . 

4.2  Trial  and  Error  Profile  Selection 

4.3  Monte  Carlo  Approach . 

4.4  Evaluation . 

4 . 5  Summary . 


Page 

IV-1 

IV-1 

IV-1 

IV-4 

IV-10 

IV- 14 


V 


NEW  SCREENING  PROCEDURES . 

5.1  Introduction . 

The  General  Sequential  Algorithm. 
Sequential  Algorithm  for  v . 


5.2 

5.3 

5.4 


Minimize  Probability  of  Mis- 
classif i cat ion . 


5.5 

5.6 

5.7 

5.8 


Maximize  P.Q . . 

Constrained  Objective  Functions. 
Pre-Specif ied  Form  of  Output..., 
Evaluation . 


V-l 

V-l 

V-l 

V-5 

V-7 

V-10 

V-ll 

V-15 

V-21 


MAXIMUS 


\ I. INTRODUCTION 


1 . 1  Statistical  Screening 


The  use  of  statistical  techniques  to  distinguish  members 
from  two  or  more  population  groups  got  its  greatest  impetus  from 
the  development  of  the  linear  discriminant  function  by  Fisher  in 
1935.  Although  many  advancements  in  the  state-of-the-art  have 
been  made  since  that  time,  the  LDF  remains  the  most  prevalent 
technique. 

Among  the  many  applications  of  statistical  screening  to 
practical  problems,  the  following  are  typical: 

•  screening  tax  returns  for  cases  with 
underestimated  tax  liability; 

•  identifying  college  football  players  with 
the  greatest  potential  for  success  in  the 
NFL; 

•  identifying  potential  fraud  in  public 
assistance  programs; 

•  screening  potential  borrowers  for  credit- 
worthiness  ; 

•  screening  suspected  criminals  for  prosecution; 

•  identifying  disease  prone  individuals. 

In  general,  the  aim  is  to  identify  two  groups  so  that  members  of 
each  group  may  be  treated  differently. 

Similar  applications  may  occur  for  the  Navy.  In  particular, 
consider  the  following: 

•  aviation  officer  program  attrition  has  been 
a  major  problem.  Costs  associated  with  in¬ 
flight  training  attrition  have  been  estimated 
at  over  $40  million  per  year.  A  10%  decrease 
in  attrition  in  each  phase  of  flight  training 
could  result  in  a  savings  of  over  $2  million 
per  year; 

•  the  attrition  rate  among  first-term  enlistees 
is  10%  during  recruit  training  and  another 

7%  in  the  remainder  of  the  first  year.  At 


MAXIMUS 


an  estimated  cost  in  excess  of  $5,000  per 
first-year  failure,  a  25%  reduction  in 
attrition  would  result  in  savings  in  excess 
of  $20  million. 

Thus,  the  potential  for  improving  the  screening  process  for 
recruits  or  trainees  is  very  high  if  an  effective  technique  can 
be  developed.  In  the  next  section,  we  review  briefly  some  of 
the  important  considerations  in  screening. 

1.2  Considerations  in  Screenin 


In  this  section,  we  introduce  three  of  the  major  consider¬ 
ations  relevant  to  a  screening  problem: 

•  the  descriptive  variables; 

•  the  interaction  among  variables; 

•  the  objective  function. 


1.2.1  Variables 


In  most  applications,  the  researcher  has  a  set  of 
variables  X-.,  ...,  X,  which  can  be  used  to  describe  population 
members.  The  intent  is  to  use  these  variables  to  distinguish 
members  from  each  group.  The  variables  themselves  may  be  cate¬ 
gorized  on  the  basis  of  their  scale  of  measurement: 

•  a  nominal  scale  is  used  to  distinguish 
between  different  classes.  The  particular 
values  taken  on  by  a  nominal  scale  variable 
have  no  particular  significance.  They  may 
be  renamed,  relabeled  or  reordered  without 
changing  the  meaning; 

•  an  ordinal  scale  is  used  if  there  is  an 
underlying  ordering  of  classes; 

•  an  interval  scale  is  used  if  the  difference 
between  two  values  has  meaning; 

•  a  ratio  scale  is  an  interval  scale  with  a 
true  zero  point. 

The  scale  definitions  are  such  that  each  scale  incor¬ 
porates  the  properties  of  the  preceding  scale.  Thus,  a  variable 
measured  on  a  ratio  scale  provides  more  information  than  a 
variable  on  a  nominal  scale,  for  example.  On  the  other  hand, 


1-2 


MAXIMUS 


mathematical  operations  that  may  be  meaningful  for  an  interval 
or  ratio  scale  might  not  be  applicable  to  variables  measured  on 
a  nominal  or  ordinal  scale. 

Nonetheless,  statistical  screening  techniques  are  often 
applied  without  attention  to  the  types  of  variables  describing 
population  members.  I”  particular,  users  apply  sophisticated 
mathematical  procedures  to  variables  measured  at  the  nominal 
level.  Some  of  the  problems  encountered  with  such  techniques 
are  described  in  Chapter  III.  One  of  the  major  objectives  of 
this  research,  then,  is  to  develop  a  class  of  statistical  screen¬ 
ing  techniques  which  are  designed  to  be  compatible  with,  and 
meaningful  for,  nominal  level  variables. 

1.2.2  Interaction  Among  Variables 


In  many  practical  applications,  especially  with  human 
populations,  it  may  be  the  combination  of  variable  values 
describing  an  individual  that  is  a  distinguishing  factor.  Some 
statistical  techniques  focus  on  the  significance  of  single  vari¬ 
ables,  considered  one  at  a  time,  and  miss  important  relationships 
that  exist  (see  Chapter  II).  With  the  increased  power  of  com¬ 
puters,  the  ability  to  detect  interaction  effects  has  been 
greatly  enhanced.  The  approaches  we  develop  here  take  advantage 
of  this  power. 

1.2.3  Objective  Functions 

Much  of  the  statistical  research  pertaining  to  statis¬ 
tical  screening  has  focused  on  the  probability  of  misclassifi- 
cation  as  the  appropriate  function  to  be  minimized.  In  practice, 
however,  the  ultimate  user  may  have  a  different  objective 
function  related,  perhaps,  to  cost,  resources  and  other  con¬ 
straints.  In  Chapter  II,  we  discuss  alternative  objective 
functions  of  interest  and  in  Chapter  III  we  state  that  a  desir¬ 
able  property  of  a  statistical  screening  procedure  is  that  it 
be  flexible  with  respect  to  the  types  of  objective  functions 
that  can  be  handled.  Again,  a  major  objective  of  this  research 
is  to  develop  a  class  of  statistical  screening  procedures  having 
this  property. 

1.3  Research  Questions 


With  the  above  considerations  in  mind,  then,  the  following 
research  questions  may  be  identified: 

•  Can  the  screening  problem  be  characterized 
in  a  uniform  fashion?  What  objective 
functions  are  pertinent  to  the  screening 
problem?  How  are  they  related? 


1-3 


MAXIMUS 


•  How  can  alternative  techniques  be  compared 
on  a  statistical  basis?  Can  confidence 
intervals  and  hypothesis  testing  procedures 
be  developed  for  parameters  of  interest? 

•  What  properties  should  a  screening  technique 
have?  How  do  some  of  the  well-known  screen¬ 
ing  techniques  fare  with  respect  to  these 
properties? 

•  What  new  procedures  can  be  developed  to 
handle  the  type  of  problem  under  consider¬ 
ation?  How  well  have  these  procedures 
performed  in  actual  practice?  What  are  the 
features  of  these  procedures? 


The  ultimate  objective  of  the  research  is  to  develop  a  new 
class  of  statistical  screening  procedures  that  have  general 
applicability  to  a  wide  range  of  practical  problems. 


1.4  Overview  of  the  Report 


While  much  of  statistical  research  must,  by  necessity,  deal 
with  theoretical  considerations  with,  perhaps,  restrictive  or 
unrealistic  assumptions,  there  is  also  a  strong  need  for 
applications  oriented  research  that  may  open  up  the  field  of 
statistics  to  a  wider  range  of  actual  problems.  The  research  in 
this  report,  while  based  on  sound  statistical  underpinnings,  is 
definitely  oriented  towards  practical  applications.  This  is  both 
a  strength  and  a  weakness.  It  is  a  strength  in  that  there  are 
many  real-life  situations  in  which  the  results  could  immediately 
be  put  to  use;  it  is  a  weakness  in  that  the  results  are  based, 
to  a  large  extent,  on  intuition  and  experience  rather  than  on 
rigorous  mathematical  development.  The  nature  of  the  problem 
makes  it  rather  intractable  for  closed-form  analysis. 

Nonetheless,  the  research  presented  here  represents  a  new 
step  in  the  field  of  statistical  screening,  a  step  that  we 
believe  contributes  greatly  to  the  current  state-of-the-art. 

The  closest  parallel  to  the  type  of  research  presented  here  is 
the  work  of  Sonquist  and  Morgan  in  developing  the  Automatic 
Interaction  Detection  (AID)  technique.  The  rapid  popularity 
AID  has  gained  as  a  technique  in  the  few  years  since  it  was 
developed  is  a  testament  to  the  need  for  applications-oriented 
research  in  this  field. 


1-4 


MAXIMUS 


The  report  is  organized  as  follows: 

•  In  Chapter  II,  we  define  the  problem 
framework  and  underlying  assumptions. 

We  show  how  the  screening  problem  can 
be  characterized  by  three  parameters. 

We  present  some  of  the  potentially 
relevant  objective  functions  and  use 
some  results  from  the  field  of  pre¬ 
diction  logic. 

•  In  Chapter  III,  we  review  the  major  com¬ 
peting  techniques  in  terms  of  five 
properties  that  we  believe  should  be  held 
by  an  effective  screening  technique. 

•  In  Chapter  IV,  we  present  an  initial 
approach  taken  to  development  of  a  tech¬ 
nique  satisfying  the  above  techniques. 
Results  of  an  empirical  test  of  the 
approach  are  presented. 

•  In  Chapter  V,  we  develop  a  class  of 
statistical  techniques,  based  upon  a 
General  Sequential  Algorithm. 


r 


II.  PROBLEM  FRAMEWORK 


MAXIMUS 


II.  PROBLEM  FRAMEWORK 


2.1  Introduction 


In  this  chapter ,  we  present  the  basic  framework  for  charac¬ 
terizing  the  screening  problem.  First ,  we  introduce  the  assump¬ 
tions  and  notation  for  the  type  of  screening  situation  under 
consideration.  Then ,  we  demonstrate  that  the  problem  can  be 
specified  in  terms  of  three  basic  parameters.  This  leads  to  a 
discussion  of  possible  objective  functions  for  the  decision 
problem.  Then ,  we  relate  the  screening  problem  to  the  field  of 
prediction  logic  and,  using  this  relationship,  develop  some 
results  that  are  useful  for  defining  and  evaluating  procedures. 

2 . 2  Notation  and  Assumptions 

Assume  that  we  have  a  random  sample  of  size  n  from  a  mixed 
population  n=niun2,  where  n1nn2  =  <i>.  Each  observation  is  a 
vector  3  from  the  sample  space  X=(Xj  ,  •  .  .  .  ,  Xp)  where  each 
X^  ,  i=l,  ...,  p,  can  take  on  any  of  si  discrete  values.  We 
also  assume  that  the  population  membership  of  each  sample 
observation  is  known. 

The  aim  of  a  statistical  screening  procedure  is  to  develop 
decision  rules  D  =  <Dj ,  D2>,  D zD ,  of  the  type: 

if  XeD1(  we  assign  x  to  Hi 

if  XeD2,  we  assign  X  to  n2. 

That  is  Di  and  D2  are  partitions  of  the  sample  space  -X-.  We  are 
restricting  the  analysis  to  the  case  where  DiUD2=x  and  D1nD2  =  i}i. 
That  is,  every  observation  is  assigned  to  n  x  or  n2-  Some 
researchers  allow  the  possibility  of  a  set  E»3  =  (DiUD2)c  in  which 
no  decision  is  made  or  new  observations  are  assigned  to  JIj  or  n2 
according  to  some  random  process. 

Thus,  given  a  decision  rule  D=  < D: ,  D2>,  the  population  n 
may  be  partitioned  into  a  2  x  2  table  as  shown  below: 


£>i 

d2 

Hi 

Pn 

P 1  2 

Pi- 

n2 

p2i 

P2  2 

P2- 

p.i 

P.2 

1 

MAXIMUS 


r 


where  the  P..  =  P(  n  . ,  D.).  i  ,0 
2x2  table^ontains^he^sample 


=  1,2.  For  the  sample, 
frequencies  p .  . . 


the 


To  simplify  the  notation,  consider  a  reduced  set  of  param¬ 
eters.  Specifically,  let 


P(xeD  i )  =  P(Di)  »  P 

P(xell  i  |xeD  i )  =  P(  n  !  |  D i )  =  Q 

P(xe  JI  ] )  =  P(  n2 )  =  E  =  1  -  P(n2) 


Then,  the  table  entries  are  as  follows: 


pll= 

P(D  i  , 

nx)  =  PQ 

P21= 

P(Di) 

-  P(Di , nx )  =  P  - 

PQ  =  P.(l-Q) 

P  1  2  = 

P(D2, 

n i  )  =  P( n i )  -  P(D 

i ,nx )  =  E  -  PQ 

P22  = 

p(  n2 ) 

-  P(d2, nx )  =  l  - 

P  -  (E  -  PQ) . 

The  entries  are 

shown 

below : 

Di 

d2 

Hi 

PQ 

E-PQ 

E 

n2 

P- (1-Q) 

l-P-(E-PQ) 

1-E 

P 

1-P 

1 

Since  E  is  a  constant,  decision  rules  D  are  characterized 
by  the  pair  (P ,  Q) . 

The  corresponding  statistics  for  any  sample  are  given  by 
p,  q  and  e.  Note  that  e  is  a  constant  for  any  given  sample  being 
used  to  construct  decision  rules  D . 


It  is  possible  to  construct  a  boundary  that  contains  all 
decision  rules  by  considering  the  following  constraints: 

1.  0  ^  P,  Q^l  (since  P  and  Q  are  probabilities). 

2.  PQ^E  (since  E  =  PCD!,!^)  +  P(D2,Hi)S 

P(Di,n1)  =  pq) 


p.  (l-Q)^l-E  (since  1-E  =  P(Dx,n2)  +  P(D2,n2)a 
P(DX ,n2)  =  P.(i-Q))  . 


1 1-2 


MAXIMUS 


c  \ 

I  Exhibit  2.1  illustrates  the  form  of  this  boundary:  I 


Note  that  when  PQ=E ,  IljCDi.  That  is,  Dj  contains  all  of  n. 
When  P .(  1-Q)  =  1-E  ,  D1  contains  all  of  n2. 

The  key  question  that  arises  from  this  description,  however, 
is  the  following.  Given  two  decision  rules  Da  and  D^,  as  de¬ 
picted  in  the  exhibit,  which  one  is  better?  That  is,  before  we 
can  develop  procedures  for  developing  decision  rules,  we  must 
have  some  concept  of  what  a  "good"  decision  rule  is.  In  the 
material  that  follows ,  we  develop  some  theory  of  objective  func¬ 
tions  to  answer  this  question. 

2 . 3  Objective  Functions 

2.3.1  Minimize  Probability  of  Misclassif ication 

The  most  prevalent  objective  function  used  in  screening 
problems  is  the  probability  of  misclassif ication .  For  a  decision 
rule  D  =  <  ,D  2>  ,  the  true  probability  of  misclassif ication  is 

t(D )  =  P(Di,  n2)  +  p(d2,  no,  or 

t(D)  =  EL2  (x)  ( 1-E )  +  EL>i  (x)  E 
D  i  D, 

where  L.  (x)  is  the  density  of  x  under  population  n.,  i  =  1,2. 

In  terras  of  the  parameters  defined  in  2.2,  we  have 


(1) 


t(D )  =  P.(l-Q)  +  E-PQ 
=  E  +  P . ( 1-2Q) 


J 


MAXIMUS 


Since  the  decision  rule  is  based  on  a  sample  from  n , 
the  estimated  probability  of  misclassif ication  is 

t(D)  =  e  +  p  ( l-2q ) .  (2) 

Therefore,  a  reasonable  objective  for  the  screening 
process  is  to  find  a  decision  rule  D*  =<  Dj  *,  D2*>  to  minimize 
t(D).  That  is, 

t(D*)  =  t (d )  =  optimum  probability  of 

u  “  misclassif ication 

where  D is  the  class  of  all  possible  decision  rules. 

In  practice,  we  are  dealing  with  a  sample  and  the 
corresponding  objective  is  to  minimize  t(D),  that  is,  to  find 
D**  such  that 

~  2_  n  f 

t(D ** )  =  t(D)  =  estimated  optimum  probability 

e  of  misclassif ication . 

Several  authors  [Cochran  and  Hopkins  (1961);  Mills 
(1966);  Mickey  (1968);  Glick  (1972,  73)  and  Goldstein  and  Wolf 
(1977)]  have  studied  the  relationships  among  t(D**),  t(D*)  and 
t (D** ) .  They  show  that 


E(t(D**))  ^  t(D* ) . 


That  is,  the  estimated  optimum  misclassif ication  proba¬ 
bility  tends  to  be  an  underestimate  of  the  true  optimum  proba¬ 
bility.  This  is  intuitively  reasonable  since  the  screening 
procedure  finds  the  best  rule  for  the  sample  rather  than  the 
population.  This  is  similar  to  the  result  in  fitting  regression 
models  based  on  a  sample:  the  sample  R2  tends  to  overestimate 
the  true  R2 . 

Similarly,  it  holds  that 

t(D*)  ^  t(D** ) ,  (4) 

with  equality  only  if  D*  h  D**.  This  is  true  by  definition  since 
t(D*)  «  t(D)  ^  t (D )  -V-  DeZ? . 

Lackenbruch  (1975)  proposes  a  method  to  estimate 
E  (t(D**))  ,  called  the  mean  apparent  (misclassification)  error. 
He  uses  n-1  points  to  classify  the  remaining  point.  This  pro¬ 
cedure  is  repeated  for  all  points  and  he  records  the  proportion 
misclassif ied  for  each  group.  His  estimate  is 


1 1-4 


MAXIMUS 


+ 


( l-E) 


m  2 

“ 


"N 


(5) 


where  «i ,n2  are  the  number  of  sample  cases  from  nx  and  n2  respec 
tively,  and  w1,m2  are  the  number  misclassif ied  from  IT 2  and  n2. 

Note  that  Lackenbruch ' s  estimate  assumes  knowledge  of 
the  true  probabilities  of  membership  in  Hi  and  n2. 

In  fact,  a  jacknife  procedure  (Miller,  1974)  may  be 
used  as  follows:  let  t  .  denote  the  probability  of  misclassif i- 
cation  with  the  jth  observation  removed.  Then,  the  jacknifed 
apparent  classification  error  is 

t  (D**)  =  n  t(D**)  -  I  t  .  (D**)  (6) 

J  n  j=1  J 

Example  of  Bias 

Consider  the  following  trivial  sample-based  decision 
ruleD  =  <D1,  D2>  with 

D i  =  e  n2} 

D  2  =  D  i C 

where  x  denotes  the  sample  observations, 
s 

That  is,  a  new  observation  is  assigned  to  n2  if  it 
matches  exactly  one  of  the  sample  observations  from  nx.  Other¬ 
wise,  it  is  assigned  to  n2.  Assume  further  that  there  are  no 
exact  matches  in  the  sample  of  observajions^that  are  in  both  nj 
and  n2,  i.e. ,  if  x  e  n2  and  x,  e  n2,  x  £  x^-  This  assumption 
is  quite  reasonable  for  small  samples  with  large  k . 

By  construction, 

t  (D )  =  0  =  pep  t  (D )  =  t(D**)^t(D**) 

That  is,  the  sample  probability  of  misclassif ication 
is  zero.  In  many  practical  problems,  this  decision  rule  is 
available  but  is,  of  course,  neither  realistic  nor  useful. 

2.3.2  Minimize  Variance  of  Estimates 

We  can  define  indicator  variables  as  follows: 


MAXIMUS 


f  x  e  n  1 

*  *  e  n. 


Y+  =  f1  if  x  e  Di 
x  (0  if  *  e  D2 


where,  as  before,  D='<Di,  D2>  is  any  decision  rule. 

The  estimated  variance  of  the  estimates,  a? ,  is 

I  (Yx  -  YS>2  =  -  Yv)2  +  j  (Y$  -  Y^)2 

-*■  n  -*■  _  n  -*■  _  n 

x  xeU  i  xeTl2 


*  US  - 

xelli  n 


I  1  Yx  -  Yx  I 
xen2  n 


=  P(D2nn!)  +  P(Dinn2) 

=  Estimated  probability  of  misclassif ication  =  t(D). 

2 

Similarly,  the  true  variance  is  =  t(D). 

Thus,  we  have  shown  that  minimizing  the  probability  of 
misclassif ication  is  equivalent  to  minimizing  the  variance  of 
estimates . 

2.3.3  Maximize  R2 

The  results  of  2.3.2  can  be  extended  to  include  a 
measure  equivalent  to  the  R2  measure  used  in  curve  fitting: 


R2  = 


Total  Variance-Unexplained  Variance  Explained  Variance 


Total  Variance 


Total  Variance 


Total  Variance  - 

n  n 


n  e  -  n  e‘ 


=  e(l-e) 


Unexplained  Variance 


=  I(T,  -  VX)2  ,  „  + 


p  -  2  pq 


Thus , 


d 2  _  e  ( 1-e )  -  ( e+p-2pq )  _  ,  _  e+p-2pq 
e(l-e)  e(l-e) 

Probability  of  Misclassif ication 
Total  Variance 


=  1  - 


-6 


MAXIMUS 


Since  e  is  fixed  for  any  sample,  maximizing  R2  is 
equivalent  to  minimizing  the  probability  of  misclassif ication. 
Note  also  that 


R2  =  1  <  =  >  e  +  p  -  2pq  =  0  <==>  p  =  e  and  q  =  1 
R2  =  0  <==>  e  +  p  -  2pq  =  e  (1  -  e) 

<  =  >  p  (l  -  2q)  =  -e2. 


2.3.4  Maximize  Yield 


In  some  practical  applications,  the  aim  may  be  to 
detect  as  many  cases  from  n j,  as  possible.  That  is,  the  intent 
is  to  maximize  PQ  or,  for  the  sample,  to  maximize  pq.  Note  that 
pq/e  is  the  sample  proportion  of  n ^  cases  correctly  classified 
into  n i . 

2.3.5  Maximize  Q  Subject  to  Fixed  P 

As  discussed  in  Chapter  I ,  the  aim  of  the  screening 
effort  is  to  identify  new  cases  that  are  likely  to  be  from  n j  so 
that  they  may  be  treated  differently,  e.g.,  auditing  tax  returns. 
However  resources  may  be  limited,  for  example,  the  IRS  may  only 
be  able  to  audit  10%  of  each  year's  returns. 

Thus,  another  objective  function  may  be  to  maximize 
Q  subject  to  fixed  P  =  P*,  say.  Thus,  the  space  of  decision 
rules  is  reduced  to  those  with  a  value  of  P^sP*. 

When  dealing  with  a  sample,  however,  we  only  have  the 
sample  realizations  p  for  each  decision  rule.  For  this  situa¬ 
tion,  the  following  approach  is  suggested.  Compute 


P*  =  P*  +  za  JV*  (1  -  P»). 

where  Z  is  the  normal  ordinate  associated  with  a  one-tailed 
probability  of  a . 

Then,  the  sample  decision  rules  are  restricted  to  those 
with  a  realized  p^p*.  Note  that,  for  any  decision  rule, 

P(P  5*  P*  |  p  p*)  >  1  -  a  , 

assuming  the  normal  approximation  to  the  binomial.  An  exact 
value  for  p*  may  also  be  computed  if  the  requirements  for  the 
normal  approximation  do  not  hold. 

^ _ J 


1 1-7 


MAXIMUS 


Conversely,  in  some  problems,  it  may  be  desirable  to 
fix  a  minimum  Q  value  and  maximize  P  among  decision  rules  with 
that  value  of  Q  or  better. 


2.3.6  Equivalence  of  Objective  Functions 

Interestingly,  if  P  is  fixed  then  the  above  four 
objective  functions  become  equivalent.  That  is, 

rain  (E  +  P  -  2PQ)  =  Max  (PQ)  =  max  Q 
D  D  D 

=  min  (estimated  variance  of  estimates). 

D 

The  equivalence  does  not  hold  if  P  is  merely  constrained  to  be 
above  a  certain  value. 

2.3.7  Minimize  Costs  of  Misclassif ication 


Let  C21  be  the  cost  of  classifying  cases  from  Jl2  into 
n 1  and  ci 2  from  n x  into  n2.  Then,  the  expected  cost  of  mis¬ 
classif  ication  is 


feet 


E(C)  =  c2rP(D1nn2)  +  c12-P(D2niii) 

=  c2 1 P . ( 1  -  Q)  +  c 1 2  (E  -  PQ) 

=  Pc2i  +  Ec 1 2  -  PQ  ( c 1 2  +  c2i) 

For  the  case  where  P  =  E  and  Q  =  1, 
decision  rule,  we  have: 

E(C)  =  c2iE  +  ci2E  -  E  ( c 1 2  +  c2i) 

For  the  case  where  PQ  «  E, 

E(C)  =  Pc2i  -  Ec2i  =  (P  -  E)  c 2 1.  • 


that  is,  the  per- 

=  0 


This  is  intuitively  reasonable  since,  if  PQ  =  E,  no  IT  1  cases  are 
classified  into  n 2 ,  so  the  cost  of  misclassif ication  depends 
only  on  the  n2  cases  classified  into  n j . 


Upper  bounds  on  the  optimal  costs  of  misclassif ication 
may  be  derived  as  follows:  two  trivial  rules  are 

dU)  :  classify  all  cases  into  n  j_ 


D<2) 


classify  all  cases  into  n2 


1 1-8 


MAXIMUS 


Under  D  M  ,  Ep  (C)  =  c2rP(Dinn2)  +  c12-P(D2nn1) 

=  c21P-(1-Q)  =  c21  (1-Q)  =  c21  (1-E)  since  P  =  1 
and  Q  =  E. 

Under  D  U)  ,  ED  (2)  (C)  =  c12  (E  -  PQ)  =  c12E,  since 

P  =  0. 


En*(C)  =  inf  En(C)  ^  min  (c21  (1-E),  C12E). 

DeO 

In  the  most  general  context,  there  may  be  costs  and 
benefits  associated  with  correct  decisions  as  well  as  incorrect 
decisions.  Thus,  we  assume  there  are  weights  w..,  i,j  =  1,2 
associated  with  each  entry  in  the  2x2  table,  ^he  expected 
"cost"  of  classification  is,  therefore, 


E(  w)  =  £  w .  .  P  .  . 

i.i  tJ 

are  as  defined  in  2.3. 


,  i ,  j  =  1,2,  where  the  P 


13 


In  terms  of  the  parameters  P,  Q  and  E,  we  have 
E(w)  =  wnPQ  +  w12  (E-PQ)  +  w21  (P  (1-Q)  )  +  w2 2 ( 1  -P-  (E-PQ)  ) 


=  PQ  (wj i-Wi 2-w2 j +w2 2)  +  Ewj2+Pw21  +  ( 1-P)w22-Ew22 . 
For  the  "perfect"  screen  with  P  =  E  and  Q  =  1,  we  get 


E(w)  =  E( W; 1-W! 2-w2 1+W22 )  +  Ew12+Ew21+  ( 1-E)w22-Ew22 
=  E(Wi i -w i i )  +  w22 . 

Thus,  the  "perfect"  screen  does  not  yield  zero  cost  unless  wjj  = 
w22  =  0.  The  decision  rule  yields  a  negative  cost  if  w22  s  E 
or  if  w22  =  wi i  <0.  w22-wii 


2.3.8  Screening  as  a  Hypothesis  Testing  (Decision)  Problem 


For  each  new  observation  x  we  can  view  the  screening 
problem  as  a  decision  between  two  hypotheses: 


V 


x  e  IT 


H 


A* 


x  e  IT  - 


Thus,  for  a  decision  rule  D  =  <D i ,  D2>,  D2  is  the  rejection 
region  for  Hq ,  with 


J 


1 1-9 


MAXIMUS 


°(D)  =  P(xeD2 jxeUi )  =  — g—  =  1  -  ^jr  (1) 

S(D)  =  P (xeD i  |xell2)  =  ~CTQ~  (2) 

-(D)  +  6(D)  -  ^  ♦  W1  -  (-^  PE(l-Q) 

E-PQ  -  E2  +  PQE  +  PE  -  PQE 
E  ( 1-E) 

_  E(l-E)  -  P(Q-E)  _  P(Q-E)  ... 

E(l-E)  1  ~  E(l-E) 

If  P  =  E  and  Q  =  1,  o(D)  +  8(D)  =  0. 

In  classical  hypothesis  testing,  the  aim  is  to  minimize 
6(D)  for  fixed  a(D)^a*  where  a*  is  a  pre-specif ied  level.  Thus, 
we  must  minimize  (3)  subject  to  (1)  less  than  a*,  i.e., 


I  -  ^  ^  a*  => 


S*  1  -  a* 


PQ  3*  E  (1  -  a*)  . 


But  *(°) -i-rtfSf’ 

which  is  minimized  by  maximizing  P*(Q-E). 

Thus,  in  terms  of  classical  hypothesis  testing  the 
problem  is: 

maximize  P(Q-E)  subject  to  PQ2»E(l-ct*)  (4) 

If  instead,  we  wish  to  minimize  the  sum  of  a(D)  and 
6(D),  then  by  (3)  the  problem  is  to 


'  P(Q-E) I  _ 
1  “  E( 1-E ) J  ~ 


=  max  P(Q-E) 


Note  again  that,  for  fixed  P,  the  objective  function 
is  to  maximize  Q. 

2.4  Screening  and  Prediction  Logic 


2.4.1  Background 


Hildebrand,  Laing  and  Rosenthal  (1977)  consider  the 

use  of  a  variable  X  taking  on  c  values  x* ,  . . . ,  x  to  predict 

c 


11-10 


MAXIMUS 


the  state  of  another  variable  Y  taking  on  r  values  Yi,  Y  . 

They  define  a  degree  1  proposition  as  being  one  that  makes  ar 
prediction  on  Y  for  each  observation  on  X. 

If  we  consider  the  decision  rule  D  =  <Dlt  D2>  as  a 
two-state  prediction  variable  and  n  =  <11! ,  n2>  as  the  two-state 
dependent  variable,  then  we  have  a  simple  prediction  logic 
statement: 


If  xeDi,  predict  x£n1(  i.e., 
Similarly,  D2-*-fi2  • 


That  is,  in  Hildebrand  et  al ' s  notation,  we  define 


X  = 


X i  if  D i  occurs 
X2  if  D2  occurs 


Y 


Y  !  if  n !  is  the  true  state 
Y2  if  n2  is  the  true  state 


with  the  corresponding  2x2  table: 


Yi 

y2 


Xi  X2 


Pll 

P12 

Pi. 

P2  1 

P22 

P2. 

P.l 

P.2 

1 

and  the  prediction  logic  statement  Xi~>Yi,  X2->Y2. 

In  prediction  logic  problems,  rules  linking  X  states 
to  Y  states  are  formulated  a  priori .  In  the  screening  problem, 
we  use  the  data  to  identify  decision  rules  (prediction  variables), 
hence  the  prediction  is  ex  post.  Nonetheless,  the  theory  of 
prediction  logic  can  be  used  to  develop  alternative  objective 
functions  and  methods  of  evaluating  results.  This  is  the  subject 
of  the  remainder  of  this  chapter. 

2.4.2  Proportionate  Reduction  in  Error 

In  section  2.3,  we  discussed  some  of  the  possible 
objective  functions  for  the  screening  problem.  In  this  section, 


II- 


MAXIMUS  1 


( 

we  introduce  an  alternative  objective  function  that  has  proper¬ 
ties  that  make  it  preferable  to  the  others  in  some  circumstances. 
We  also  compare  this  objective  function  with  some  of  the  well- 
known  measures  of  association  in  2  x  2  tables. 

Costner  (1965)  argues  that  a  measure  of  association  in 
a  contingency  table  should  have  what  he  calls  a  proportionate 
reduction  in  error  interpretation  using  the  following  four 
criteria: 

1.  The  user  must  define  a  rule  for  predicting  the 
dependent  variable  given  the  independent  variable 
(this  is  called  rule  K) . 

2.  The  user  must  define  a  corresponding  rule  for 
estimating  the  dependent  variable  without  the 
independent  variable  (called  rule  U) . 

3.  The  user  must  define  "errors  '  in  prediction. 

4.  The  measure  must  be  defined  in  proportionate 
reduction  in  error  form: 

Errors  Rule  U  -  Errors  Rule  K 
Errors  Rule  U 

Thus,  a  proportionate  reduction  in  error  (PRE)  measure 
reflects  the  proportionate  reduction  in  error  achieved  by  using 
the  information  in  the  independent  variable  rather  than  just  the 
information  in  the  dependent  variable. 

Applying  the  PRE  concept  to  the  screening  problem  we 

have : 

Rule  K:  X1->Y1  and  X2-Y2.  Defining  an  "error"  as 
a  misclassif ied  case,  the  proportion  of  errors 
by  rule  K  is  simply  the  total  probability  of 
misclassif ication,  E  +  P  -  2PQ. 

Rule  U:  In  order  to  correspond  to  rule  K,  the  rule 
U  must  classify  the  same  proportion  into  Yj  and 
Y 2  as  rule  K  did.  Thus,  rule  U  randomly  classifies 
a  proportion  P  of  the  cases  into  Yi  and  1-P  into 
Y2.  The  expected  proportion  of  errors  by  rule  U, 
then,  is  P(l-E)  +  (l-P)E,  or  E  +  P  -  2PE.  Note 
that  this  is  the  probability  of  misclassif ication 
when  Q  =  E,  i.e.,  Q  =  P(Y1|Xi)  =  P(Yj)  =  E,  so 
that  Xi  provides  no  information. 

Thus,  the  PRE  measure,  denoted  by  V ,  is 

^ _ J 


11-12 


MAXIMUS 


/ 

„  _  (E  +  P  -  2PE)  -  (E  +  p  -  2PQ) 

E  +  p  -  2PE 

=  2P-(Q  -  E) 

E  +  P  -  2PE 

The  corresponding  7  value  for  the  sample  is 

v  =  2p.(q  -  e) 
e  +  p  -  2pe 


(1) 


(2) 


Again,  maximizing  7,  for  fixed  P,  is  equivalent  to 
maximizing  Q.  That  is,  maximizing  V  becomes  equivalent  to  the 
objective  functions  discussed  in  section  2.3.  I  now  develop 
some  basic  results  for  this  PRE  measure.  In  all  these  results 
I  assume,  without  loss  of  generality,  that  0<E<J. 


Theorem  2.4.1:  7<0  <=>Q<E.  Thus,  any  decision  rule  with  Q<E 

is  inadmissible  since  rule  U  outperforms  it  (or  the  rule  X1->Y2, 
X2- > Y i )  .  Furthermore,  7  =  0<  — >  Q  =  E. 

sisot-  ’  -  Ue’  °  «  -  E> 

2P 

where  c  =  E  +  p 2E j  50  since  E<^ 

Thus  V <0<  =  >Q<E  as  required.  The  second  statement 
holds  trivially. 


Theorem  2.4.2: 


Proof : 


7  =  1- 


7<1  with  equality  <  =  >  P  =  E  and  Q  =  1. 
E  +  P  -  2PQ 


E  +  P  -  2PE 

E  +  P  -  2PQ 
E  +  P  -  2PE 


>1 


<0 


=>  E  +  P  -  2PQ  =  Probability  of  misclassif ication 

<0 


which  is  impossible.  Hence,  7<1 

7  =  1=>E  +  P-  2PQ  =  0 


=  > 


=  > 


=  > 


Vs. 


(E  -  PQ)  +  (P  -  PQ)  =  0 
E  =  PQ  and  P  =  PQ  (since  E^PQ  and  P>PQ) 
E  =  P  =  PQ 
Q  =  1 


11-13 


MAXIMUS 


^The  converse  holds  also.  If  Q  =  1  and  P  =  E, 


2E  (I  -  E) 

E  +  E  -  2E2 


1. 


Thus,  Theorems  2.4.1  and  2.4.2  show  that,  for  any 
admissible  decision  rules,  0  <  7<1,  with  7=0  when  Q  =  E  (the 
decision  rule  provides  no  additional  information)  and  7=1  when 
Q  =  1  and  P  =  E  (the  perfect  decision  rule). 


Theorem  2.4.3: 
rules  with  (Pi 


>7 


D 


If  D  U)  and  D  <2!  are  two  admissible  decision 
,  Q)  and  (P2,  Q)  the  respective  parameters,  then 
<  =  >  Pi>P2- 


Proof : 


vdW  " 

=>  Pi 


VD  (2) 


2Pj  -  (Q  -  E)  > 
E  +  Pj (1-2E) 


[E  +  P2  (1  -  2E]  >  P2 
since  Q>E  and  0<E<£ 


2P? • (Q  -  E) 

E  +  P2 ( 1-2E) 

[E  +  Px  (1  -  2E )] 


— >  PXE  +  PiP2  -  2P i P2E  >  P2E  +  PiP2  ~  2PjP2E 
=>  Px  >  P2  as  required. 

Reversing  the  above  steps,  the  converse  holds. 


Theorem  2.4.4:  If  D  =  <DX ,  D2>  is  a  decision  rule  and  Vp^and  VD2 
are  the  PRE  measures  for  the  components  of  that  decision  rule, 
then , 


„  _  Rule  Un  )  V  +  (Rule  U  )  7 

vn  _  _ Di  Di _ Rz _ Rz 

Rule  +  Rule  Un 

L)  1  Do 

That  is,  the  PRE  measure  for  the  decision  rule  D  can  be  derived 

as  the  weighted  average  of  the  PRE  measure  for  each  component. 

Proof:  Di  is  the  equivalent  to  the  prediction  logic  statement 

X!->Y! . 


Rule  K  •  Under  Dx ,  errors  occur  only  for  cases 

1  (Y2,  Xx)  which  occur  with  frequency  P*(l-Q). 


Rule  Ur,: 

- Dj 


Under  Dx,  predictions  are  made  for  a  pro¬ 
portion  P  of  the  cases  with  errors  occurring 
with  probability  1-E  so  that  rule  U  errors 
are  P-(l-E) 


P.(l  -  E)  -  P-(l-Q)  =  Q  -  E 

Dx  P  (1  -  E)  1-E 


MAXIMUS 


r 


Rule  K. 


Rule  IL 


.‘.V, 


D2  is  equivalent  to  the  statement  X2->Y2. 

Under  D2,  predictions  are  made  for  the  cases 
2  (Y1}  X2)  which  occur  with  frequency  E  -  PQ. 

Under  D2 ,  Rule  U  predictions  are  made  for  a 
72  proportion  1  -  P  of  the  cases  with  errors 
occurring  with  probability  E,  so  Rule  U 
errors  are  (1  -  P)E. 

-  (1  -  P ) E  -  (E  -  PQ)  _  P  (Q  -  E) 


(1  -  P)E 


(1  -  P)E 


Now 


(Rule  Un  )  VD  +  (Rule  Un_ )  V 


LL. 


D; 


D,  = 


Rule  U_  +  Rule  U_ 
ui  U2 

=  P-Cl-E)  (£|)  -  (l-P)E  , 

P.(l-E)  +  (l-P)E 


2P « ( Q-E) 
E+P-2PE 


=  as  required. 


2.4.3  Comparison  to  Other  Measures 
2. 4. 3.1  Guttman's  X 

Consider  the  following  PRE  measure: 


Rule  U: 


Rule  K: 


Predict  the  most  likely  value.  Since  we 
assume  E<£,  this  means  we  predict  Y2(n2) 
for  every  case  and  make  errors  with  proba¬ 
bility  E. 

Predict  the  most  likely  Y  value  given  the 
X  value.  For  our  situation,  this  means 
Xj->Y i  and  X2->Y2,  with  errors  =  probability 
of  misclassif ication  =  E  +  P  -  2PQ. 


Then  v^= 


E  -  (E  +  P  -  2PQ)  _  2PQ  -  P 
E  E 


Guttman's  X  is  given  by 

X  =  -  M 

1  -  M 


(1) 


(2) 


where  Mj  =  max  V-ij 
M  =  max  P.,-  . 


11-15 


MAXIMUS 


In  our  notation, 

M:  =Pn  =  PQ 

M2  =  P22  =  (1  -  P)  -  (E  -  PQ) 

M  =  1  -  E 

•  ,  =  PQ  +  (1  -  P)  -  (E  -  PQ)  -  (1  -  E) 

1  -  (1  -  E) 

PQ  +  1  -  P  +  PQ  -  1  _  2PQ  -  P  _ 

E  E  X  {  } 

Thus,  we  have  demonstrated  that  Guttman's  X  has  a  PRE 
interpretation . 

Theorem  2.4.5:  v  =  1  <=>  V,  =  X  =  1 

-  A 

<  =  >  Q  =  1  and  P  =  E 

Proof :  We  have  already  shown  (Theorem  2.4.2)  that  V  =  1  <=>  Q  =  1 
and  P  =  E.  Therefore  we  need  only  show  X=  1  <=>  Q  =  1  and  P  =  E. 

If  Q  =  1  and  P  =  E,  X  =  =  1. 


If  X  =1,  then  using  (2)  we  have 


-t.Pzz  ■-  Pz: 
1  -  P,. 


=  > 

Pn  + 

P22 

1 

=  > 

P2  i  + 

Pl2 

= 

0 

=> 

+ 

r— 4 

CNJ 

Oi 

P 1  2 

= 

0 

=> 

P*(l  - 

■  Q) 

= 

0 

and  E  -  PQ  =  0 

=  > 

P  =  E 

and 

Q 

= 

1  as  required. 

Thus , 

X  shares  the 

property  that  it 

the  decision  rule  is  perfect. 

Also , 

X  =  0  <=>  q  = 

X <0<=>  q<£. 

This  is  consistent  with  the  PRE  derivation  of  X  since, 
when  q  =  i,  there  is  no  modal  class  for  Xi .  When  q<£,  the  modal 
class  has  not  been  selected  given  Xi  which  means  that  the  pre¬ 
diction  should  have  been  Xi->Y2,  for  which  X^O.  Thus,  if  the 


MAXIMUS 


decision  rule  is  selected  according  to  Rule  K, 

0  <  X  <  1 . 

Note  also  that,  for  fixed  P,  maximizing  \  is  equivalent  to  maxi¬ 
mizing  Q. 

2. 4. 3. 2  Goodman  and  Kruskal ' s  t 

Consider  the  following  PRE  development: 

Rule  U:  Predict  Y;  for  a  proportion  E  of  the  cases  and 
Y 2  for  a  proportion  1  -  E  of  the  cases.  The 
error  rate  will  be  E  (1-E)  +  (l-E)E  =  2E  (1-E). 

Rule_K:  Predict  Yi  for  a  proportion  Pii/P-i  of  the 

cases  and  Y2  for  a  proportion  P21/P.1  when  Xi 
occurs;  predict  Y2  for  a  proportion  P22/P.2 
of  the  cases  and  Y\  for  a  proportion  pi2/P.2 
when  X2  occurs. 

Expected  error  rate  under  Rule  K  is: 


/Pn 

P2  1  +  P2 1 

pn\ 

+  P-2  fPi2 

P2  2 

+  P22 

P  1  2  \ 

\P-1 

P-l  P-1 

P-l  / 

VP- 2 

P-2 

P^ 

P-2/ 

=  2  ( pi 1  P2 1  +  P22  pi 2 1 

\  p-i  p • 2  / 

=  2  ^PQ  •  Ppd  -  Q)  +  (E  -  PQ)  (1  -  P  -  (E  -  PQ)^ 


=  PQ  (1  -  Q)  +  (E  -  PQ)  - 


(E  -  PQ)2 


=  (-  PQ2  +  E)  (1  -  P)  -  B2  +  2EPQ  -  P2Q2 

1  -  P 

_  -  PQ2  +  E  -  PE  -  E2  +  2EPQ 
1  -  P 

=  E  (1  -  E)  -  P  (Q2  -  2EQ  +  E) 

1  -  P 


V  =  1  - 

t 


E  (1  -  E)  -  P  (Q2  -  2EQ  +  E) 
1  -  P 
E  (1  -  E) 


=  (1  -  P)  E  (1  -  E)  -  E  (1  -  E)  +  P  (Q2  -  2EQ  +  E) 

(1  -  P)  E  (1  -  E) 


11-17 


MAXIMUS 


11-18 


MAXIMUS 


"\ 


But  this  is  the  same  as  (1).  That  is, 

T  ,  , _ P-J.9-J)2 

t  (l-P).E. (1-E) 

Therefore,  x  has  a  PRE  interpretation. 

Note  that,  if  P  =  E  and  Q  =  1, 

E  (1-E)2 


(6) 


T  = 


=  1 


(1-E)  E  (1-E) 

If  Q  =  E,  x  =  0. 

Also,  x^O  since  all  terms  in  (7)  are  greater  than  zero. 

2.4. 3. 3  Extension  to  Weighted  Errors 

Let  Wi2>0  and  W21>0  be  the  weights  (costs)  associated 
with  prediction  errors.  Then, 

Rule  K  Expected  Cost  of  Errors:  W12Pi2  +  W21P2i 

-  W12  (E  -  PQ)  +  W21P  (1  -  Q) 

Rule  U  Expected  Cost  of  Errors:  W12P.  i  P?  .  +  W21P.2  Pi. 
=  W12E  ( 1-P )  +  W21  (1-E)  P 


Then  V 


D,  W 


Rule  U  -  Rule  K 
Rule  U 


W I 2E-W ! 2 EP+W 2 1 P-W  2 1 EP- W  2 1 E+W ! ? PQ-W 2 l P+W 2  x PQ 
W I ZE-W i 2PE+W2 i P-W 2 i EP 

=  PQ  (W12  +  W2i)  -  PE  (W12  +  W2i) 

EW12  +  PW21  -  PE  (W12  +  W2i) 

P  (Q  -  E)  (ff, ,  +  W7 , ) 


EW12  +  PW21  -  PE  (W12  +  W21) 
D.  W 


(1) 


Properties  of  nr  include: 


1.  For  W]2  =  W 2i  -  W,  say,  v 


D.W  ~  VD 


Proof  •  V  =  P.C9.-  .g)  2W.  .  2P„.(Q_^-E)  =  7 

-  --  ‘  D,W  EW  +  PW  -  2PEW  E  +  P  -  2PE  D 


11-19 


MAXIMUS 


7  is  invariant  under  constant  multiplication  of 
weights:  that  is,  ^DjkW  =  VDW 


Proof:  V 


_  P  (Q-E)  ( kW i  2_jL_kff2jJ _ 

D,kW  ~  Ek Wi2  +  PkW2i  -  PE  (kW12  +  kU 21)  D,W 


(However  ^  is  not  invariant  under  addition.) 

3.  7^  ^  =  0  < — >  Q  =  E 
Proof :  Obvious  from  (1). 

4.  w  =  1  <  — >  Q  =  1  and  P  =  E 


Proof :  If  Q  =  1  and  P  =  E,  V 


E  (1  -  E)  (Wi?  +  W? i ) 

EW12  +  EW21  -  E2  (W12  +  W2i) 

.  E  (1  -  E)  (W1?  +  W?1)_ 

/  T-*  TTl  X  \  /TIT  ■  TIT  \  • 


If  V. 


(E  -  E^ )  (W12  +  W21) 

t  _ Rule  U-Rule  K  _  _  Rule  K  _  ^ 

-  then  — Riirru - 1  _>  mn^u  -  0 


=>  Rule  K  =  0  =>  W12(E-PQ)  +  W21P(1-Q)  =  O 
=  >  e-PQ  =  0  and  P(l-Q)  =  O  (since  W12,  W2 j  >0) 

=>  P  =  E  and  Q  =  1. 

Thus,  v  is  a  straightforward  extension  of  VQ  and,  as 
such,  is  a  preferred  objective  function  for  any  problem  where 
there  is  a  difference  in  costs  associated  with  the  two  decisions. 
Furthermore,  by  property  2.,  the  user  need  only  specify  the 
relative  magnitude  of  the  costs. 

2 . 5  Statistical  Inference 

In  the  preceding  sections,  we  have  introduced  objective 
functions  that  may  be  relevant  to  the  screening  problem.  However, 
in  practice  we  must  deal  with  sample  estimates  of  the  objective 
function.  Thus,  questions  of  statistical  inference  arise. 

In  the  material  that  follows,  we  consider  the  PRE  measure 

=  EP  (Q  -  E) 

D  E  +  P  -  2PE 


11-20 


MAXIMUS 


(  ' 'N 

and  its  sample  based  estimator 

V  =  2p  (q  -  e) 
e  +  p  -  2pe 

The  results  are  easily  extended  to  ^  and 

Recall  that  we  are  assuming  a  random  sample  from  the  popu¬ 
lation  n  =  (ITj,  n 2 )  so  that  the  true  marginal  probabilities  are 
unknown  for  both  variables  X  and  Y  and  neither  set  of  marginal 
totals  is  fixed  (the  sample  is  not  stratified).  The  sample  re¬ 
sults  may  be  written  as  follows: 


Hi 

n2 


E>1  D2 


Pll 

Pi  2 

Pi- 

P  2  1 

P22 

P  2  • 

P  -1 

P  -2 

1 

2.5.1  Estimated  Variance  of  v 


In  terms  of  the  above  table, 

v  =  1  _  _ El, 2. . +  .  P.2,1 

Pi-  P  -2  +  P2-  P  -1 

For  sufficiently  large  n,  each  of  the  P ,  p .j 

follow  an  approximately  normal  distribution  with  means  P^j ,  P^.  , 
P -j  respectively  and  variance 

eii  (i-p m  -  jp±±\-p i-.>  -  /p  *  (i-p  n  • 

n  r  n  "  n 

Thus,  v  is  a  well-behaved  function  of  variables  that 
each  asymptotically  follow  a  normal  distribution  with  variance 
approaching  zero.  Thus  7  itself  follows  an  asymptotic  normal 
distribution  with  variance  estimated  by  using  the  Taylor  expan¬ 
sion  for  7 ,  i . e . , 


where 


V  =  C 


+  II  _ 

ij 


a .  . 
t-  3 


P.  . 

W 


-  a .  .  = 

13 


9  V 

9P  .  . 
*3 


(1) 


11-21 


11-22 


MAXIMUS 


2.5.2  Hypothesis  Tests  and  Confidence  Intervals 


Given  7  and  var  (7),  the  asymptotic  normality  of  7  allows 
confidence  intervals  to  be  formed  in  the  usual  way,  i.e.,  the 
100  (1  -  a)%  confidence  interval  is  given  by 


7  ±  Z 


rvar  ( 7 ) . 


where  Z  .  is  the  standard  normal  deviate  for  probability  a.  . 
a/  2  /  2 

To  test  the  hypothesis  Hq ^  7  =  0  against  the  one-sided 
alternative  7>0,  we  reject  Hq  if  7>Z  /v ar  ( 7 )  where  Z  is  the 
standard  normal  deviate  for  type  one  error  =  a.  1 

To  compare  two  independent  7  values  (e.g.,  7  for  two 
independent  samples,  or  for  a  before  and  after  comparison),  use 

r,  _  7  I  —  7 


var  ( 7 ! ) +  var  (72)  v  ; 

and  reject  H  -  7,  -  v2  =  0  if  I Z I  3s  Z  .  . 

U  A  a  /  2 

In  the  before  vs.  after  comparison,  a  one-sided  test 
might  be  appropriate,  i.e., 

Hq:  7i-72  vs.  Ha:  7j>72. 

Reject  Hq  if  Z>Za< 

Note:  As  usual,  the  normal  approximation  may  be  im¬ 

proved  by  replacing  7  by  7+  =  1  -  ^k  +^l/njor  7-  =  1  - -  1/n.j 

For  example  the  confidence  interval  becomes 


7  -  Za 


'var  (7)  ,  7~  +  Za  ,  /var  (7)  . 

/  2 


To  test  the  validity  of  the  normal  approximation  the 
usual  rules  apply: 

5  <  nk  and  n  (1-k),  where  k  =  rule  k  error  rate 
=  probability  of  misclassif ication . 

Hildebrand  et  al  (1977)  conducted  some  Monte  Carlo 
experiments  with  7  and  var  (7)  with  the  following  general  con¬ 
clusions  : 


11-23 


MAXIMUS 


A 

The  bias  of  V  is  small,  particular  for  n>100.  Bias 
seems  to  depend  on  the  skewness  of  the  marginal. 

For  the  case  of  2  x  2  tables,  they  found  the  bias  to 
be  negative,  i.e.,  v  is  a  conservative  estimator. 

Var(v)  appears  to  be  a  good  approximation,  although 
generally  conservative. 

Var(v)  is  seriously  biased,  usually  negatively,  for 
small  samples,  especially  when  nk<5.  However,  the 
continuity  correction  helps  adjust  for  the  bias. 

2 .6  Statistical  Inference  for  Ex  Post  Analysis 

The  decision  rules  D  are  based  on  analysis  of  a  sample  from 
the  joint  population  n  =  <n1(  n2>.  Thus,  since  the  rules  are  not 
selected  a  priori .  there  is  the  danger  that  rules  with  a  high 
value  of  5  are  fitting  the  data  rather  than  the  underlying 
situation.  In  particular,  high  values  of  y  will  tend  to  be 
optimistically  biased,  especially  for  small  sample  sizes. 

In  practice,  the  potential  bias  may  be  minimized  by  the 
followi ng : 

1)  Choosing  only  rules  D  which  make  sense  in  the  light  of 
intuitive  knowledge  about  the  populations  n i  and  n2. 

That  is,  subjective  knowledge  could  be  used  to  help 
choose  between  rules  with  similar  values  of  7. 

2)  Avoiding  rules  that  involve  too  many  variables.  That  is, 

for  two  rules  D  ^  and  D^2)  with  7  ms  ^n(2)  >  choose 
the  rule  with  the  fewer  variables.  u 

More  technical  methods  include: 

1)  Develop  the  decision  rule  on  one  portion  of  the  sample, 
then  test  it  on  the  remainder  of  the  sample.  The  test 
of  statistical  difference  described  in  2.5  could  be 
used . 

2)  Use  a  jacknife  type  procedure,  wherein  V  .  is  the  result 
when  developing  the  rule  on  all  observations  except  the 
jth,  with 

V*  =  nV  -  — -  z  V 
n  .  , 

<7=1  -<7 

(See  Miller,  R.  G.  [1974]:  "The  Jacknife  -  A  Review,” 
Biometrika ,  61,  1-15.) 

^  J 


11-24 


MAXIMUS 


3)  Develop  the  rule  on  one  sample,  then  test  the  results 
on  a  new  sample  from  the  population.  Although  this  is 
similar  to  1),  there  are  practical  differences  in  this 
approach.  See  Chapter  4  for  an  application. 

4)  Develop  hypothesis  tests  that  take  into  account  the  Ex 
Post  nature  of  the  analysis. 

In  this  section  we  consider  approach  4).  To  do  this,  we 
make  use  of  some  results  of  Hildebrand  et  al .  Specifically, 
they  suggest  that,  for  Ex  Post  analysis,  the  correct  hypothesis 
is 

Hq:  No  >  0  vs  H^:  Some  vD  >  0. 

The  hypothesis  is  highly  restrictive  in  the  context  of 
screening  since  it  implies  statistical  independence  between  the 
dependent  variables  and  any  combination  of  the  independent 
variables . 

To  develop  the  test  statistic,  the  variance  of  V  must  be 
computed  under  the  assumption  of  statistical  independence. 
Hildebrand  et  al  (p.  223)  derive  this  variance  as: 


(n-l)U< 


+  °2)  (l: 


Where  n  .  =  IW .  .P .  . 

j  ^  tJ 

n  .  =  iw .  .p  . 

•j  i  u  >• 


For  the  case  =  W22  =  0;  W^2  =  W21  =  1.  the  expression 

can  be  simplified  considerably: 


n  i  • 

=  P-2 

n2. 

ii 

*0 

n-i 

-  p2- 

n  ■  2 

=  Pi- 

Thus  ( 1 )  becomes 

1  iPi.P.p  +  Pp.P.i  -  (l-P)2Pi .-P2P? .-(1-E)2P. i-E2P. 7+U2! 

(n-l)U2  (  ' 

1  {E(l-p)+(l-E)P  -  (1-P)2E  -  P2(1-E)-(1-E)2P-E2(1-P)+U2! 

^ (n-l)U2  ' 


MAXIMUS 


=  (-n_-T)U^  {e-EP+P-PE2-E+2PE-P2E-P2+P2E-P+2EP-E2P-E2+PE2+U2  j 

=  oPW  !-p2  +  2EP  -  E2  +  d21 

*  THTThF  !°2  -  <p  -  E>2| 

_ 1_  K  _  (P-r-E)2! 

n-1  \  \  U  /  )  (2) 

Another  expression  for  the  variance  can  be  developed  as 
follows : 

Variance  =  — tjy  jl  -  (^-^)2j 

-  T^TW  lu2  '  (P  -  E)21 

■  (i-l)ni  S<E  +  p  -  2PE)2  -  (P  -  E)2I 

=  jE2  +  P2  +  4P2E2  +  2EP  -  4P2E  -  4PE2  -  P2  +  2EP  -  E2  j 

=  ^_-[yuZ  j+4P2E2  +  4EP  -  4P2E  -  4PE2 j 

*  (irfiF  |EP  (PE  e  1  -  P  -  E)} 

-  (Hi?  JEP  (1  -  E  -  P  (1  -  E)  )} 

-  oCTItF  |E  (1  -  E)  P  (1  -  P)j 

=  ^ -I  jproduct  of  marginalsj  (3) 

Note  that  the  variance  expression  depends  only  on  P  and  E. 
This  is  not  surprising,  of  course,  since  under  the  hypothesis  of 
independence,  Q  =  E. 

The  test  statistic  is,  then, 

Z2  =  v2  (4) 


11-26 


MAXIMUS 


where  var„  V  is  expression  (2)  or  (3)  with  p  and  e  replacing  P 
and  E.  Hildebrand  et  al  show  that  is  approximately  ;<2  with 

(R-l)  (C-l)  degrees  of  freedom.  As  discussed  previously,  how¬ 
ever,  the  degrees  of  freedom  for  the  screening  problem  with  k 
variables  X-£  each  taking  on  values  will  be  large  and,  in 
nearly  every  application,  greater  than  30.  Therefore,  we  reject 
H0  if 

z2  >  X2  (a) 

where  a  is  the  specified  type  1  error. 

This  test  is  highly  conservative  in  that  it  would  reject 
the  hypothesis  of  independence  much  less  frequently  than  warrant¬ 
ed.  For  instance,  if  we  assumed  the  decision  rule  had  been 
selected  a  priori ,  a  standard  x2  test  for  independence  in  a  2  x  2 
table  would  have  been  used  with  only  1  degree  of  freedom.  For 
the  Chi-square  distribution,  (a)  >  \2  (a)  for  any  integer  m>l. 

2 . 7  Summary 

In  this  Chapter,  we  have  shown  the  following: 

•  Any  decision  rule  for  screening  can  be  character¬ 
ized  by  the  parameters  P,  Q  and  E. 

•  There  are  a  number  of  objective  functions,  each 
of  which  can  be  described  in  terms  of  the  above 
parameters,  that  can  be  considered.  However, 

a  Proportionate  Reduction  in  Error  objective 
function  has  intuitive  appeal. 

•  The  screening  problem  can  be  described  as  an 
Ex  Post  search  for  a  decision  rule  that  makes 
good  predictions  in  the  context  of  prediction 
logic . 

•  Confidence  intervals  and  hypothesis  tests  for 
the  PRE  measure  are  developed,  including  a 
conservative  test  for  the  significance  of  a 
decision  rule  selected  Ex  Post. 


However ,  we  have  not  yet  discussed  some  of  the  techniques 
for  developing  decision  rules  based  on  sample  data.  In  the 
following  Chapter,  we  review  some  of  the  well  known  techniques 
and,  in  so  doing,  provide  a  rationale  for  the  new  techniques 
discussed  in  Chapters  4  and  5. 


V _ J 


11-27 


MAXIMUS 


(  > 


In  this  Chapter,  we  review  some  of  the  well  known  statisti¬ 
cal  screening  methodologies.  The  purpose  of  this  review  is  to 
identify  some  of  the  deficiencies  associated  with  these  tech¬ 
niques,  especially  for  practical  problems  dealing  with  qualita¬ 
tive  variables.  Based  on  this  review,  we  define  five  general 
properties  that  a  desirable  screening  technique  should  possess. 
These  properties  are  then  used  to  motivate  the  new  statistical 
screening  techniques  discussed  in  Chapters  IV  and  V. 

The  approaches  considered  in  this  Chapter  are: 

•  Linear  Discriminant  Analysis 

•  Regression  Analysis  Approach 

•  Logit  and  Probit  Analysis 

•  Multinomial  Models 

•  Automatic  Interaction  Detection 

3 . 1  Linear  Discriminant  Function 

Fisher's  Linear  Discriminant  Function  (LDF),  developed  in 
1935,  is  both  the  foundation  of  and  most  prevalent  of  statistical 
screening  techniques.  For  this  reason,  I  will  use  the  LDF  as 
the  basis  for  establishing  properties  that  a  statistical  screen¬ 
ing  technique  should  have,  particularly  when  dealing  with 
qualitative  variables. 

In  terms  of  decision  rules  D  =  <D x ,  D2>,  the  optimal  parti¬ 
tion  to  minimize  the  probability  of  misclassif ication  for  the 
case  where  Hj  and  n2  are  multivariate  normal  N  (ulf  Z),  N  (02>  X) 
with  prior  probabilities  E  and  1-E  respectively  is, 

Di*  -  (11(0!  -  S2)'  z~‘  (it  -  *  (Si  +  Sj)  *  (1) 

°2*-  ■  <^} 

Replacing  tjj,  32  and  Z  by  the  usual  MLE,  we  obtain  the 
classification  rule: 

Di*  =  {x(x!  -  x2)s"i(2  -  i  (Ui  +  U2))  *  |  (2) 

Although  (1)  is  optimal  when  the  assumptions  of  normality 
and  equal  covariance  hold,  optimality  cannot  be  claimed  when 
there  are  departures  from  the  assumptions.  In  particular,  the 


MAXIMUS 


\ 

assumption  of  equal  covariance  matrices  tends  to  be  restrictive. 

In  such  cases,  a  quadratic  discriminant  function  arises  instead 
of  the  linear  function  of  (1).  For  qualitative  variables,  the 
assumption  of  multivariate  normality  will  be  violated  except  for 
large  samples. 

Goldstein  and  Dillon  (1977)  present  an  example  that  demon¬ 
strates  the  inappropriateness  of  the  LDF  for  qualitative 
variables : 


Let  Xi 
X2 


(0  if  birth  weight  is  low 
(1  if  birth  weight  is  high 

(0  if  gestation  length  is  short 
(1  if  gestation  length  is  long 


Normal  babies  have  high  birth  weight  and  long  gestation 
length  or  low  birth  weight  and  short  gestation  length.  Abnormal 
babies  have  either  of  the  other  two  combinations  ((0,  1)  or 
(1,  0) ). 


The  LDF  decision  rule  is  of  the  form: 


if  BiXi  +  B2X2^c,  classify  in  IT  j  (normal  group) 

if  B ^ X 2  +  B2X2 <c ,  classify  in  n2  (abnormal  group). 

For  the  rule  to  classify  correctly,  we  must  have 
BiXi  +  B2X2»c  for  (0,  0)  and  (1,  1)  and 

B 1X1  +  B2X2 <c  for  (0,  1)  and  (1,  0) 

Thus : 


0  s-  c  for  ( 0 ,  0 ) 

Bi  +  B2  »  c  for  ( 1 ,  1 ) 
but  B2  <  c  for  (0,  1) 

and  Bj  <  c  for  (1,  0) 

.'•Bi  +  B2  <  2c  <  c  <  0,  unless 

This  result  can  be  generalized  by 
function 


L  (x) 


B0  + 


P 

Z 


Xj 


c=0=>B1=B2=0. 
considering  the  likelihood 

(1) 


1 1 1-2 


i 


MAXIMUS 


Now  consider  any  pair  (Xi,  X2),  say.  From  (1), 

L  (1.1,X3,  .  .  .  *  )-L(0,l,X3 _ ,X  )  +  L  (1,0, X3,  •  •  •  ,X  )  - 

L  (0,0, X3 , • • • ,X  ) 

or  L  (1,1)  =  L  (0,1)  +  L  (1,0)  -  L  (0,0)  (2] 

If  L  (0,0)  <  min  ( L (0 , 1) ,  L (1,0) ), 
then  L  (1,1)  >  L  (0,1)  +  L  (1,0)  -  min  (L(0,1),  L  (1 ,0)  ) 

=  max  (L  (0,1),  L (1 , 0) ) . 

Thus,  if  (0,0)enj  and  (0,1),  (l,0)ell2,  then  (1,1)  could  not  be 
classified  into  n * . 

For  qualitative  variables  where  the  categories  have  no 
particular  order,  linear  combinations  of  variables  may  be  mean¬ 
ingless.  Although  the  LDF  need  not  perform  poorly,  there  is 
always  the  danger  it  could  do  so  in  examples  of  the  above  type. 
Clearly,  a  random  reordering  of  the  categories  of  the  variables 
could  significantly  effect  the  LDF  and  its  success  in  classi¬ 
fication.  For  example,  by  redefining  Xj  so  that 


„  (1  if  birth  weight  is  lo\ 

1  (0  if  birth  weight  is  his 


while  keeping  the  same  definition  for  X2 ,  it  is  no  longer  im¬ 
possible  to  classify  all  possibilities  correctly. 

Based  on  this  observation,  I  suggest  that  a  desirable 
property  of  a  statistical  method  for  screening  with  qualitative 
variables  is: 


II38BB8B 


:  The  statistical  method  should  not  be  affected  by 


random  reordering  of  categories  of  each  variable. 


The  LDF  is  also  restricted  in  terms  of  the  objective  func¬ 
tions  it  can  consider.  That  is,  it  is  set  up  to  minimize  the 
total  probability  (or  costs)  of  misclassif ication .  The  LDF 
cannot  be  used  to  minimize  v  which,  as  we  have  seen,  is  a  better 
criterion  when  predictability  is  the  major  concern.  Also,  the 
LDF  is  not  amenable  to  maximizing  the  probability  of  misclassifi- 
cation  subject  to  fixed  resources,  i.e.,  P^P*,  although  users 
often  apply  it  in  this  way.  We  contend,  however,  that  this  pro¬ 
cedure  involves  a  potentially  erroneous  assumption,  as  discussed 
below . 


» 


MAXIMUS 


In  many  practical  applications,  the  LDF  is  developed  from 
the  sample  without  any  constraints.  Then,  the  function  is 
applied  to  each  observation  to  yield  a  "score".  The  observations 
may  be  ranked  from  highest  to  lowest  based  on  this  score,  i.e., 

3  U) . 3  with  scores  Cj  to  cn.  If  P*  is  the  desired  P 

value,  a  cutoff  score  is  found  such  that  there  at  most  nP* 
values  above  the  score  and  at  least  n»(l-P*)  values  below  it. 

For  example  if  n  =  100  and  P*  =  0.2,  c*  should  be  chosen  so  that 
c20>c*>C2i ,  e.g. , 

c*  =  C20  +  C2I 
C  2 

Although  this  appears  to  be  a  reasonable  procedure,  it  may 
not  be  optimal.  This  is  because  it  involves  the  implicit  assump¬ 
tion  that  there  is  a  monotonic  relationship  between  the  score 
and  the  probability  of  membership  in  Hi,  i.e.,  for  c*>c 

P  (n  i  I  c  (x)  >C*)  5=  P  (  n  !  I  C  (x)  >  c), 

where  c  (x)  denotes  the  score  of  x  on  the  LDF.  Put  another  way, 
the  assumption  is,  for  c*>c, 

if  D  j  *  =  jx|c(x)>c*j  and 

D  i  =  |x  |  c(x  )> c  |  , 

then  Qp  ,  where  QD  =  PCn^xeD).  The  assumption  may  be 

critici  ied  on1 two  counts: 

1)  For  qualitative  variables,  linear  combinations  of 
variable  values  have  no  real  meaning,  so  that  a  higher 
score  on  the  LDF  need  not  imply  greater  likelihood  of 
membership  in  rij  (the  example  of  birth  weight  is  one 
such  case) . 

2)  Even  if  the  assumption  holds,  the  region  Di*  may  not  be 
optimal  among  regions  for  which  P  (Dj)^P*. 

Note  that  the  P*  value  could  also  be  achieved  via  a  second 
procedure  b^  randomly  allocating  P*/P  (assuming  P*<P)  of  the  ob¬ 
servations  x  for  which  c(x)>c  into  n j .  This  procedure  would  yield 
the  same  value  of  Q.  As  we  have  seen,  for  fixed  Q,  '7n>vnjf 
P>P*.  More  generally,  there  is  a  tradeoff  between  P  anduQ  so 
that  if  P  decreases,  Q  should  increase  and  vice  versa.  A  proced¬ 
ure  that  reduces  P  while  fixing  Q  cannot  be  considered  appro¬ 
priate  . 

In  sum,  the  LDF  approach  does  not  lend  itself  to  "control” 
of  the  decision  rule  parameters  P  and  Q.  In  practical  problems, 


MAXIMUS 


r 


this  control  may  be  quite  important.  That  is,  the  user  should 
be  able  to  specify  the  objective  function  appropriate  to  the 
problem  before  selecting  a  screening  technique.  Thus  leads  to 
a  second  property  of  a  screening  method: 


:  The  statistical  screenin* 

;  technique  should  be 

| flexible  with  respect  to  the  objectiv* 

=  function  and  constraints 

I  that  can  be  handled. 

The  following  objection  to  the  LDF  is  a  practical  one. 

After  the  LDF  has  been  constructed,  the  user  must  apply  it  to 
each  new  case  to  ascertain  probable  population  membership.  In 
the  absence  of  computer  support,  it  may  be  difficult  to  compute 
the  score,  especially  if  a  large  number  of  variables  are  involved. 
In  cases  where  quadratic  discrimination  must  be  used,  the  objec¬ 
tion  is  even  more  relevant.  Also,  the  ultimate  users  may  be 
concerned  about  the  meaning  of  the  output,  e.g.,  they  may  wonder 
why  a  person's  age  category  is  multiplied  by  2,  and  added  to 
marital  status  multiplied  by  3,  etc.  Getting  the  user  to  imple¬ 
ment  such  an  output  may  be  problematic.  Thus,  a  third  property 
of  a  screening  technique  is: 


:  The  form  of  the  final  output  of  a  statistical 

screening  technique  should  be  meaningful  for  and  amenable  to 


practical  use  in  screening  new  cases. 


The  LDF  is  constructed  by  using  the  MLE's  of  Ui ,  S2  and 
For  small  sample  sizes  and  a  relatively  large  number  of  variables, 
the  accuracy  of  these  estimates  may  be  poor.  For  example, 
Goldstein  and  Dillon  (1977,  pp.  7-10)  compare  the  LDF  constructed 
from  70%  of  a  sample  (325  cases)  with  the  LDF  constructed 
from  the  remaining  30%  (130  cases).  They  found  a  great  deal  of 
instability  in  the  relative  rankings  of  variables,  changes  in 
coefficients  from  positive  to  negative,  and  coefficient  signs  that 
were  not  expected  based  on  prior  knowledge.  The  authors  conclude 
that  the  LDF  is  highly  sensitive  to  sample  size,  particularly 
when  the  variables  are  qualitative  in  nature  as  they  were  for 
this  example. 

Sample  size  is  even  more  of  a  factor  if  there  are  departures 
from  normality.  Both  Moore  (1973)  and  Gessaman  and  Gessaman 
(1977),  in  comparing  various  discriminant  analysis  procedures, 
found  that  the  LDF  performed  uniformly  worse  than  the  other 
procedures  for  instances  where  the  assumptions  did  not  hold  and 
where  the  sample  size  was  too  small  for  asymptotic  normality. 

In  comparisons  carried  out  by  Gilbert  (1968),  the  LDF  performed 
quite  favorably.  However,  Moore  claimed  that  Gilbert's  results 
were  unrealistic  since  she  assumed  an  underlying  linear  model  for 


1 1-5 


MAXIMUS 


which  the  problems  of  category  ordering  ("reversals"  in  the  like¬ 
lihood  function  as  shown  in  the  birth  weight  example)  would  not 
occur.  Thus,  the  LDF  should  be  restricted  to  instances  where  the 
sample  size  is  large  (a  rule  of  thumb  is  that  there  should  be  at 
least  50  observations  per  variable). 

Another  desirable  property  of  a  statistical  screening  tech¬ 
nique  is  that  it  should  work  well  with  the  small  sample  size 
often  encountered  in  practice: 


1 


:  The  statistical  screening  technique  should  be 


applicable  for  a  wide  range  of  sample  sizes. 


Several  authors  have  addressed  the  issue  of  bias  in  the 
apparent  error  rate  (sample  probability  of  misclassif ication)  in 
discriminant  analysis.  As  discussed  before,  the  expected  appar¬ 
ent  error  rate  is  a  negatively  biased  estimate  of  the  expected 
actual  error  rate.  However,  this  result  is  due  to  the  Ex  Post 
nature  of  the  analysis  and,  as  such,  will  tend  to  occur  for  any 
technique  that  develops  decision  rules  based  on  sample  data. 

Thus,  this  criticism  cannot  be  leveled  at  the  LDF  alone. 

Finally,  the  LDF  is,  by  construction,  an  additive  model. 
Thus,  the  approach  may  work  poorly  when  there  is  an  interaction 
effect  among  variables.  The  following  example  illustrates  the 
problem: 

Consider  two  binary  variables  Xi  and  X2  with  the  following  sample 
results : 


Xi 

0 

1 

100 

100 

500 

500 

600 

600 

Total 


1000 


1200 


X2 

0 

1 

100 

100 

500 

500 

600 

6000 

Total 


1000 


1 1-6 


MAXIMUS 


(Xi,  5 

:0 

(0,0) 

(0,1)  ! 

(1,0) 

(1,1) 

Total 

Hi 

0 

100 

100 

0 

200 

n2 

500 

0 

0 

500 

1000 

Total 

500 

100 

100 

500 

1200 

Note  that  Pdl^X^  =  P(  n  x  |  X2  )  =  P(ni) 
p( n  2 1 x2 )  =  P(n2 |x2)  =  P(n2) 

Thus,  Xj  and  X2  are  both  independent  of  n  =  dlj  ,  n2>.  That  is, 
by  themselves  X3  and  X2  would  not  be  selected  as  predictors. 
However,  if  we  choose  D  =  <  Di  ,  D2>  such  that 


Dl  =  {(0,0)  ,  (1,1)} 

D2  =  {(0,1)  ,  (1,0)} 


then  the  probability  of  misclassif ication  is  zero.  As  shown 
earlier,  however,  the  LDF  cannot  produce  decision  rule  D. 

In  some  applications,  users  attempt  to  get  around  this 
problem  by  introducing  new  variables  XiX2,  XjX3 ,  XjX2X3  etc. 
However,  this  quickly  increases  the  number  of  coefficients  to  be 
estimated,  particularly  if  the  "fully  saturated"  model  with  all 
possible  interactions  is  considered.  Decisions  about  which 
interactions  to  include  (e.g.,  only  pairwise)  are  often  made 
a  priori  with  a  resultant  loss  of  information.  In  sum,  approaches 
that  can  handle  combined  effects  of  variables  should  be  pre¬ 
ferred,  as  stated  in  property  5: 


:  The  screening  technique  should  be  able  to  handle 


interaction  effects  among  variables. 


In  summary,  we  have  shown  that  the  LDF  approach  may  be  in¬ 
appropriate  for  screening  with  qualitative  variables  and, we  have 
defined  five  properties  that  a  screening  technique  should  have. 
In  the  remaining  sections  of  this  Chapter, we  review  some  of  the 
other  well  known  techniques  for  screening  and  critique  them  with 
respect  to  the  five  properties. 


1 1 1-7 


MAXIMUS 


3 . 2  Multiple  Regression  Analysis 

Multiple  regression  analysis  may  be  used  for  screening  by 
defining  the  dependent  variable: 


f  xe  IT ! 
f  jfell 2 


and  estimating  the  equation 


Y  =  B„  +  Z  B.X 


Y  =  bn  +  Z  b.X. 


in  the  usual  way. 


The  decision  rule  D  =  <Dj ,  D2>  is  given  by 


x  Y+  ^ 


0.5  j 


xeH  i  . 


D2  =  jx|Y+  <  0.5 } 

The  estimates  Y-+  are  interpreted  as  the  probability  that 


Because  the  regression  equation  is  of  a  similar  form  to  the 
LDF,  this  approach  shares  the  problems  associated  with  the  LDF: 

Property  1 :  The  regression  equation  is  affected  by  the  ordering 
of  categories  unless  all  variable  values  are  converted  to  indi¬ 
cator  (dummy  variables),  e.g.,  if  Xx  takes  on  four  values  Xj  j , 

X 1 2  >  X  j  3 ,  X14,  new  variables  Xi \ ,  X12  and  X13  are  entered  where 


-  I1  1 

|0  o 


if  Xj  =  X 
otherwise . 


However,  this  greatly  increases  the  parameter  space,  e.g.,  if 
there  are  10  variables  each  taking  on  5  values,  there  are  41 
coefficients  to  estimate  in  equation  (1). 


Property  2:  The  regression  approach  minimizes  the  sum  of  squares, 
i.e.,  Z  (Y  -  Y)?  As  shown  in  Chapter  II,  this  is  equivalent  to 
minimizing  the  sample  probability  of  misclassif ication .  The 
approach  does  not  handle  the  objective  function  v.  Also,  if  the 
result  is  constrained  to  have  PSP*,  users  of  the  regression 


1 1 1-8 


MAXIMUS 


"\ 

approach  merely  shift  the  cut-off  probability.  Again,  this 
assumes  a  monotonic  relationship  between  probability  as  measured 
by  Y  and  the  actual  likelihood  of  membership  in  n  j  .  This  assump¬ 
tion  need  not  hold  in  general  and  will  rarely  hold  for  variables 
measured  on  a  nominal  scale. 

Property  3:  As  with  the  LDF,  the  regression  equation  value  can 
be  difficult  to  compute  for  new  cases  and  the  equation  itself 
(and  coefficients)  may  have  no  meaning  in  the  context  of  the 
problem. 

Property  4 :  The  regression  equation  requires  fairly  large  sample 
sizes  if  the  number  of  variables  under  consideration  is  high,  and 
particularly  if  all  the  variables  are  converted  to  dummy  vari¬ 
ables  as  shown  above.  Also,  the  equation  is  sensitive  to  high 
correlation  among  variables,  creating  the  problem  of  multi- 
collinearity  wherein  the  standard  error  of  coefficient  estimates 
may  be  very  high. 

Property  5 :  The  regression  equation  can  handle  interaction 
effects  but  with  the  same  problems  of  parameter  proliferation  as 
discussed  for  the  LDF. 

In  addition  to  the  above,  there  are  some  specific  problems 
associated  with  the  regression  analysis  approach  for  a  binary 
dependent  variable  -  see,  for  example,  Hanushek  and  Jackson 
(1977).  First,  the  assumption  of  homoscedastic ity  of  error  terms 
is  violated  as  shown  below: 

If  we  rewrite  (1)  in  matrix  form  as 

Yt  "  XtB  +  et  (3) 

since  =  0  or  1,  we  have 

e ,  =  -X.B  or  1  -  X.B. 
t  t  t 

Since  E  (et)  =0,  we  must  have 

E  (et)  =  -XtB  P(et=  -XtB)  +  (l-XtB) .P(et=  l-XtB) 

=  -XtB  (l-P(et»l-XtB))+  (l-XtB)P(et=l-XtB) 

=  -XtB  +  P(et  =  l-XtB) 

=  0  =>  P(e  =  l-XtB)  =  +XtB 
and  P(e._  =  X.B)  =  1-X  B 

V _ _ _ J 


J 


1 1 1 -9 


MAXIMUS 


r 


but  Var  (e  )  =  E  (et2)  =  X  B  (1  -  X  B) 

which  depends  upon  the  observations. 


'N 


Secondly,  the  regression  equation  (2)  can  yield  estimates 
outside  the  range  (0,  1)  which  is  inconsistent  with  the  proba¬ 
bility  interpretation  attached  to  the  estimates.  Also,  by  the 
discussion  above,  Var  (^  )  =  Y.  ( 1-Y  )  which  is  negative  for 
values  outside  the  rangex(0,  1J.  Some  users  treat  all  estimated 
values  less  than  0  as  0 ,  and  values  above  1  as  1 .  This  yields 
an  estimated  variance  of  zero  which  is,  clearly,  a  negatively 
biased  estimate. 


In  sum,  the  regression  approach  suffers  from  the  same 
problems  as  the  LDF  with  respect  to  the  five  properties  and  has, 
in  addition,  important  technical  problems  when  dealing  with  a 
binary  dependent  variable. 


3 . 3  Logit/Probit  Analysis 

Logit  and  probit  analysis  involve  the  specification  of  a 
functional  form  for  the  probabilities  of  class  membership.  In 
particular,  the  logistic  function  is 

P( n i )  «  l/(l+e-XB)  (l) 

with,  correspondingly, 

P(n2)  =  l/(l+eXB)  (2) 


then  log  -  log  P(na)-  log  [l-P(  n ! )]  =  X$.(3) 

Thus,  the  log  of  the  ratio  of  probabilities  is  linear  in  the 
independent  variables  X. 


The  probit  model  is 


P(  n  l ) 


1  XB 

(2n)*  ' 


e-T2/2dT 


(4) 


where  T  is  N  (0,  1).  Clearly  the  probability  increases  mono- 
tonically  with  XB. 

To  estimate  the  coefficients  in  (1)  and  (2),  we  can  use  a 
MLE  approach.  If  the  observations  Y],  . . . ,  Yn  are  ordered  so 
that  the  nj  cases  from  n*  appear  first,  then  the  cases  from  n2, 
we  have: 

v _ > 


III-10 


MAXIMUS 


n  i  n 

p(Y1(  ■  • . ,  y  )  =  n  p .  n  (l  -  p  ) 

n  i  = 1  1  i= n2+l  1 


nj  Y.  1-Y . 

=  n  P  1  (1  -  P,)  1  (5) 

i  =  1  t  1 

nj  n 

and  log  L  =  I  Y.  log  P.  +  Z  (1-Y.)  log  (1-P.)  (6) 

i=l  1  1  i=n!+l  1  1 

Substituting  (1)  for  P.  in  equation  (6)  and  taking  partial 
derivatives  with  respect  to1the  B^,  i- 1,  k,  and  setting 

them  equal  to  zero,  a  set  of  k  equations  result  which  can  be 
solved  for  the  estimated  coefficients  b,  .  In  practice,  the 
equations  are  difficult  to  solve.  A  similar  approach  may  be 
followed  for  probit  analysis. 

I  now  discuss  the  logit/probit  models  in  terms  of  the  five 
properties  previously  developed. 

Property  1 :  Invariance  with  respect  to  reordering  of  categories. 

Both  models  are  monotonic  in  XB.  As  such,  similar  problems 
occur  with  respect  to  the  ordering  of  categories  with  cases  like 
the  example  of  3.1,  i.e.,  (0,  0),  (1,  1)  are  members  of  n  x  and 
(0,  1),  (1,  0)  are  members  of  n2.  The  problem  arises  again  be¬ 
cause  of  the  application  of  multiplication  and  addition  functions 
to  qualitative  variables  where,  of  course,  such  functions  have  no 
meaning. 


Property  2:  Flexibility  with  respect  to  objective  functions. 


With  the  MLE  approach,  the  intent  is  to  develop  the  best 
estimates  of  Y  given  the  functional  form  assumed.  This  objective 
function  is  not  among  those  discussed  in  Chapter  II.  Certainly, 
it  would  not  be  possible  to  maximize  V,  say.  However,  it  is 
possible,  as  with  the  LDF  and  Regression  Analysis,  to  set  the 
cut-off  probability  in  order  to  achieve  a  desired  value  of  P* . 
That  is,  let  c  be  the  probability  value  such  that  at  most  nP*  of 
the  P.  are  below  it  and  at  least  n  (1-P*)  are  above  it.  Then 
D*  =  ^D}*,  D2*>  is  the  decision  rule  with 


where 


Di*  =  j*|P£  5*  cj 
p,  =  l/(l+e-SS) 


the  estimated  probability 


associated  with 


x  . 


III-ll 


MAXIMUS 


As  discussed  before,  this  is  not  necessarily  an  optimum 
procedure  since  Dx*  was  not  selected  as  best  among  all  possible 
decision  rules  with  P < P* , 


Property  3 :  Amenable  to  practical  use. 

Just  as  with  the  LDF  and  the  regression  analysis  approach, 
the  probabilities  are  difficult  to  compute  by  hand  and  the  form 
of  the  probability  equation  (1)  is  difficult  for  the  average 
user  to  understand  or  appreciate. 

Property  4 :  Applicable  for  a  wide  range  of  sample  sizes. 

The  sample  sizes  needed  for  logit/probit  analysis  are  simi¬ 
lar  to  those  required  for  the  LDF  or  regression  analysis  approach 
since  the  same  number  of  coefficients  must  be  estimated  from  the 
sample  data.  Thus,  for  most  problems  with  10  or  more  variables, 
sample  sizes  should  be  in  excess  of  500. 


Property  5 :  Ability  to  handle  interaction  effects 

Consider  the  derivative  of  (1)  with  respect  to  the  variable 


X. 

v 


3P( n  1 )  _  3  (l/i+e'™)  =  _ L__  _A_ 

3Xi  ~  3X.  -  (l  +  e-^f  3X. 

=  (l+e-XB)2  e_XB 

n  .  -XB 

Bj  e 


(e  '“) 


( 1+e 

_ B i 


-XB 


) 


(l-e-5®) 


^  -XB  = 
1+e 

-XB  XB 
e  e 

XB  . 
e  +1 


p(ni)  i+e 


XB 


=  Bi  P( n ! )  (l-P  (n ! )  ) 


(7) 


Expression  (7)  shows  that  the  logit  model  handles  inter¬ 
actions  since  the  value  of  the  derivative  is  a  function  of  B. 
and  P  which  is  itself  a  function  of  all  the  independent  variables. 


Similarly,  for  the  probit  function, 
3P  _  3  1  — t 2 


XB 

; 


3  X  . 

t 


3X . 


e  T  / 2  dT 


v"2lf  -a 


MAXIMUS 


=  B.  4>(XB)  where  <J> ( XB )  is  the  value  of  the 
standard  normal  denlity  at  the  point  XB . 

In  sum,  logit  and  probit  analysis  offer  some  advantages  over 
the  LDF  and  regression  analysis  for  the  type  of  problem  under 
consideration.  Nonetheless,  they  assume  an  underlying  model  form 
which  may  or  may  not  be  appropriate  in  a  particular  instance. 

This  is,  therefore,  an  undesirable  feature  if  alternative  pro¬ 
cedures  are  available  which  do  not  involve  such  an  assumption. 

3.4  The  Multinomial  Model 

If  a  random  sample  of  size  n  is  selected  from  n  =  < n ,  n2>, 
then  the  estimated  probability  that  £  =  in  population  iljis 

P(£  =  x  |  n ! )  =  (l) 

where  n^  (x)  is  the  number  of  sample  observations  in  n  x  with  the 
vector  x. 


Similarly 


P(1  =  x  |  n 2 )  =  n-*kp- 


Thus 


P(  n  !  |  %  =  x  )  =  P(  II  i  and  X  =  x) 
P(x  =  x) 

=  P( n i )  P(1  =  xln, ) 
p(i  =  x) 

Hi  .  B|(v) 
n  ni 

nx(x)+n2(x) 


qi<x) _ 

n  i  ( x )  +  n2(x) 


P( n  2 1 X  =  x)  = 


_ _ _ 

n  i  ( x )  +  n  2  ( x ) 


MAXIMUS 


The  intuitive  decision  rule  based  on  (3)  and  (4)  is  to 
classify  the  case  into  n  ^  if 

PCJii  \f  =  X)  >  P(n2  \f  =  x) , 

into  H2  otherwise  (randomly  assigned  if  equality  holds).  That  is 

D  =  <D i ,  D2) 

is  such  that 

Dl  =  |x|n1(x)>n2(x)|  D2  =  j  x  j  n:(.x)  <  n2(x)j. 

While  this  "full  multinomial"  approach  has  some  appealing 
features — particularly  the  simplicity  of  the  decision  rule — it 
has  one  overwhelming  disadvantage:  for  a  problem  with  k  vari¬ 
ables  each  taking  on  s.  values,  there  are 

1  k 

m  =  n  s  . 
i  =  1  l 


possible  observation  vectors,  e.g.,  for  k  =  5  and  s .  =  5  V- , 
m  =  5s  =  3125.  For  a  typical  sample,  many  of  the  states  will  not 
appear  at  all,  sene  will  appear  in  n i  but  not  n2  and  vice  versa, 
while  others  will  appear  so  infrequently  as  to  make  the  estimates 
highly  unreliable.  If  a„ particular  realization  x  appears  once 
in  n x  and  never  in  n2,  P(n1|X=x)  =  l.  The  same  probability 
would  result  if  x  appeared  50  times  in  n  x  and  never  in  n2,  yet 
our  intuitive  faith  in  the  probability  estimate  would  be  much 
higher  for  the  latter  finding.  If  the  user  is  faced  with^the 
task  of  allocating  a  new  case  with  an  observation  vector  x  not 
found  in  the  sample,  he  has  no  mechanism  for  making  the  choice. 

For  these  reasons,  the  full  multinomial  approach  is  con¬ 
sidered  unacceptable  for  most  applications.  The  aim,  instead, 
is  to  modify  the  approach  to  reduce  the  state  space.  For 
example,  Hills  (1967)  defines  a  "nearest  neighbour"  rule  for  the 
case  where  the  variables  are  all  dichotomous.  For  a  given 
observation  vector  x  he  defines  a  set  of  vectors  that  differ  from 
x  in  no  more  than  r  positions,  i.e., 


T  .  = 

J 


(x-y j )  (x-y^.)  <  r  j  . 


Then  the  rule  for  x  is  based  on  the  vectors  y..  That  is,  the 
decision  rule  is  as  follows:  '7 


xenj 

if 

X 

T  . 

Mf,.)  >  f 

n2(y  •) 

x  £  n  2 

if 

XJ 

TJ 

‘  T,' 

n2(y.-) 

d 

(randomly  allocate  if  equality  holds). 


III-14 


MAXIMUS 


f 


Several  other  models  have  been  developed  for  the  case  of 
dichotomous  variables,  such  as  the  Bahadur  model  in  which, 
for!-  (X1(  • • ■ ,  X  ) 


m 


P-  .  =  E.  (X-),  i  =  1,  2 
v  v  J 


X.7  -  Pi,i 


p i<**>  - E  <zijzik> 


Pi(12...m)  =  E  (Z.!  Z.2...Z.m) 


Then  P  (x I n  . ) 


-  n  pi,-  ^  (1  "  ?i,*) 

i=l 


l_x 


j  * 


Xl  + 


0  <k 


( ,jk')  Z  .  . Z  .  7  +  p  .  (  -J  Jf  T  )  Z  .  .Z  . ,  Z . , 

w  ;  t-j  tfe  j<k<l  zk  vl 


+  +  pi  d’  2-  m)  Z.lZ,2...Z. 


(7) 


Then,  the  classification  procedure  is  the  likelihood  ratio 


P(x j  nt ) 
P(x I n2) 


with  estimates  of  all  the  above  parameters  substituted  where 
appropriate . 


By  assuming  higher  order  correlations  are  zero,  the  ex¬ 
pression  (7)  can  be  reduced  significantly. 


Another  approach  developed  by  Martin^and  Bradley  (1972), 
involves  a  description  of  the  density  of  x  in  terms  of  otthogonal 
polynomials.  Matsuita  (1954,  55,  57)  develops  classification 
rules  based  on  placing  an  observation  into  the  class  such  that 
the  estimated  distributional  distance  is  maximized. 


The  rules  discussed  in  this  section  are  briefly  reviewed 
below  in  terms  of  the  properties  introduced  in  Section  3.1. 


Property  1 :  Invariance  w.r.t.  categories  of  variables 


For  those  procedures  assuming  dichotomous  variables,  this 
property  is  not  relevant.  However,  the  assumption  of  dichotomous 
variables  is  rather  restrictive  for  most  practical  applications. 


1 11-15 


MAXIMUS 


1 


(  \ 

Property  2 :  Flexibility  w.r.t.  objective  functions 

The  procedures,  with  the  exception  of  the  ones  based  on 
distributional  distance,  are  optimum  under  the  condition  that  the 
assumed  model  holds — since  the  rules  are  likelihood  ratio  based. 
However,  the  procedures  are  not  amenable  to  other  objective 
functions  or  to  control  of  the  screening  parameters  P  and  Q.  It 
appears  that  estimation  of  an  assumed  underlying  distribution 
takes  precedence  over  accurate  prediction  of  class  membership. 

Property  3:  Amenable  to  practical  use 

The  full  multinomial  approach  is  simplest  to  apply  but,  as 
has  been  pointed  out,  there  may  be  many  instances  when  a  new 
case  cannot  be  classified  because  the  observation  vector  x  did 
not  occur  in  the  sample.  The  other  procedures  are  all  rather 
complicated  involving,  in  £he  Bahadur  model  for  example,  com¬ 
puting  expressions  for  P(xjn.),  i  =  1,  2,  with  a  large  number 
of  parameter  estimates  (the  P^.  and  p ^  (jk))  . 

Property  4 :  Applicable  for  a  wide  range  of  sample  sizes 

The  multinomial  approach  clearly  requires  very  large  sample 
sizes  in  order  to  ensure  sufficient  observations  for  every  state. 
Similarly,  the  other  procedures  require  large  sample  sizes  to 
ensure  reliable  estimates  of  the  many  parameters.  Small  sample 
sizes  are  adequate  only  in  the  instances  when  there  are  very  few 
variables  taking  on  few  values  and  where  no  observation  patterns 
are  rare. 

Property  5:  Handles  interactions  among  variables 

The  approaches  do  handle  interactions  in  the  model  form  but 
the  parameter  space  tends  to  become  unwieldy.  This  often  forces 
users  to  make  a  priori  decisions  concerning  the  order  of  inter¬ 
action  terms  to  include. 

3 . 5  Automatic  Interaction  Detection  (AID) 

In  the  Preface  to  their  book  Searching  for  Structure, 

Sonquist,  Baker  and  Morgan  say  that  they  developed  the  AID  tech¬ 
nique  "...in  rebellion  against  the  restrictive  assumptions  of 
conventional  multivariate  techniques...".  Thus,  AID  represents 
an  initial  step  towards  an  approach  that  is  appropriate  for 
qualitative  independent  variables. 

The  AID  approach  involves  a  repeated  one-way  analysis  of 
variance  technique  to  explain  as  much  variance  in  the  dependent 

V _ _ _ J 


1 1 1-16 


MAXIMUS 


r  a 

variable  as  possible.  Although  the  approach  was  designed  for  a 
continuous  dependent  variable,  the  authors  state  that  a  dichot¬ 
omous  dependent  variable  may  be  used  if  (in  our  notation) 

0.2^  E  <0.8.  Thus,  AID  can  be  used  for  statistical  screening 
under  these  circumstances. 

The  error  variance,  with  no  knowledge  of  the  dependent 
variables,  is,  as  we  have  seen, 

Z (Y  -  Y)2  =  IY2  -  NY2  =  E  (1  -  E)  (1) 

For  any  two  groups  formed  by  the  data  (i.e.,  a  decision  rule 
D  =  <D1,  D  i  > )  ,  the  error  variance  is 

IY!2  -  N !  Y  j  2  +  IY22  -  N2  Y2  2 

=  IY2  -  N !  Y  j  2  -  N2Y22 

=  E  +  P  -  2PQ  as  shown  in  Chapter  II.  (2) 

The  net  reduction  in  variance  is  (1)  -  (2),  or 

P  -  2PQ  -  E2.  (3) 

The  AID  algorithm  searches  through  the  k  variables  Xj ,  X2, 

....  X,  and,  for  each  variable,  examines  the  possible  splits  of 
the  data  using  the  categories  of  that  variable.  The  variable 
for  which  (3)  is  maximized  for  specified  categories  is  chosen 
for  entry.  This  procedure  is  then  applied  to  each  "split"  of  the 
population  so  generated.  No  further  splitting  occurs  if: 

•  the  marginal  reduction  in  variance  is  below 
a  pre-specif ied  threshold; 

•  one  of  the  new  groups  would  have  a  total 
number  of  cases  below  a  specific  threshold; 

•  the  total  number  of  splits  has  exceeded  a 
pre-specif ied  amount. 

Bishop  et  al  (1975)  state  that  the  AID  process  has  the  draw¬ 
back  that  it  does  not  take  into  account  the  sampling  variability 
in  the  data.  They  cite  a  study  by  Einhorn  (1972)  which  demon¬ 
strated  that  the  AID  algorithm  consistently  created  apparent 
structures  where  none  existed  in  reality.  That  is,  the  AID 
algorithm  fits  the  data  well  but  not  necessarily  the  underlying 
situation . 

We  now  discuss  the  AID  algorithm  in  terms  of  the  properties: 

L_ _ _ J 


1 1 1-17 


MAXIMUS 


Property  1:  Unaffected  by  random  reordering  of  categories 


Because  the  algorithm  examines  single  characteristics  at  a 
time,  it  is  unaffected  by  random  reordering  of  categories.  Thus, 
the  algorithm  satisfies  property  1. 

Property  2:  Flexible  w.r.t.  objective  functions 

The  approach  is  designed  to  maximize  explained  variance, 
although  it  may  be  possible  to  consider  the  same  sequential 
splitting  process  with  a  different  splitting  function.  However, 
there  is  little  control  over  the  parameters  P  and  Q.  Thus,  the 
AID  algorithm,  in  its  current  form  at  least,  does  not  satisfy 
property  2. 


Property 


Useful  output 


The  usefulness  of  the  output  depends  upon  the  number  of 
groups  formed  by  the  data,  and  the  number  of  variable  values  in 
each  group.  If  good  results  can  be  obtained  while  constraining 
the  number  of  groups  and  group  membership,  the  output  can  be 
meaningful.  Otherwise,  the  results  may  represent  merely  a  good 
description  of  the  structure  of  the  data.  Nonetheless,  the  form 
of  the  output,  linking  characteristics  by  the  logical  and  opera¬ 
tor,  is  more  meaningful  than  other  techniques  with  output  in  the 
form  of  mathematical  operations  on  qualitative  variables.  From 
this  viewpoint,  then,  the  approach  satisfies  property  3. 

Property  4 :  Applicable  for  wide  range  of  sample  sizes 

The  AID  algorithm  works  with  both  large  and  small  sample 
sizes.  However,  the  search  type  algorithm  requires  a  large 
number  of  calculations  to  choose  among  alternative  splits  at 
each  stage.  Thus,  it  tends  to  be  more  effective  with  small 
sample  sizes. 


Property 


Ability  to  handle  interaction  effects 


Because  the  algorithm  considers  all  categories  of  variables 
at  each  stage,  it  explicitly  handles  interaction  effects.  How¬ 
ever,  it  does  not  handle  all  possible  interaction  effects  because 
of  the  sequential  nature  of  the  process. 

In  sum,  the  AID  algorithm  comes  closest  to  meeting  the 
properties  desired  of  an  approach  for  screening  with  qualitative 
variables.  However,  it  is  somewhat  restrictive.  Nonetheless, 
it  demonstrates  the  power  of  advances  in  data  processing  capa¬ 
bilities  to  provide  more  in-depth  analysis  of  data.  The  aim. 


MAXIMUS 


C  > 

then,  is  to  develop  more  generalized  procedures  for  analyzing 
qualitative  data  for  screening  populations.  This  is  the  subject 
of  the  last  two  chapters  of  this  dissertation. 

3 .6  Summary 

This  Chapter  provided  an  overview  of  the  major  competing 
statistical  screening  techniques.  In  particular,  we  discussed 
the  applicability  of  each  technique  for  practical  problems  in¬ 
volving  variables  measured  at  the  nominal  level.  In  general,  the 
techniques  failed  because  they  were  designed  for  higher  level 
variables. 

The  criticism  of  the  techniques  led  to  the  definition  of 
five  properties  that  should  be  sought  after  in  development  of  a 
new  technique  for  qualitative  variables: 

•  Property  1 :  The  statistical  algorithm  should 
not  be  affected  by  random  reordering  of  the 
categories  of  each  variable. 

•  Property  2:  The  technique  should  be  flexible 
with  respect  to  the  objective  function  and 
constraints  that  can  be  handled. 

•  Property  3:  The  form  of  the  final  output  of 
the  approach  should  be  meaningful  for  and 
amenable  to  practical  use  in  screening  new 
cases . 

•  Property  4 :  The  statistical  screening  tech¬ 
nique  should  be  applicable  for  a  wide  range 
of  sample  sizes. 

•  Property  5:  The  screening  technique  should 
handle  interaction  effects  among  variables. 

Of  the  techniques  reviewed,  the  AID  algorithm  came  closest 
to  satisfying  the  above  properties  because  it  was  developed 
specifically  for  qualitative  independent  variables.  The  aim  of 
our  research  is  to  go  beyond  the  AID  algorithm  to  develop  a  new 
class  of  procedures  that  have  wider  applicability,  which  have 
greater  power,  and  which  more  closely  satisfy  the  above  proper¬ 
ties.  Chapter  IV  describes  our  initial  efforts  to  develop  these 
procedures . 


V _ y 


1 1 1- 19 


MAXIMUS 


MAXIMUS 


IV. THE  MONTE  CARLO  APPROACH 


In  this  Chapter,  we  describe  a  new  approach  to  screening 
based  on  Monte  Carlo  simulation.  Results  of  an  actual  applica¬ 
tion  of  the  methodology  are  also  presented.  To  our  knowledge, 
the  Monte  Carlo  approach  as  developed  here  has  not  been  used 
before  for  purposes  of  statistical  screening. 

4 . 1  Background 

In  Chapter  III,  we  briefly  reviewed  the  general  types  of 
screening  techniques  in  common  use.  This  review  demonstrated 
that,  for  practical  screening  problems  with  qualitative  variables 
and  relatively  small  sample  sizes,  all  of  the  approaches  had 
some  serious  weaknesses.  The  Monte  Carlo  approach  described  here 
presents  a  first  attempt  to  develop  a  new  method  that  would  be 
appropriate  to  this  type  of  problem. 

Thus,  the  approach  taken  was  to  start  from  the  basic  con¬ 
cepts  of  screening  to  develop  a  straightforward,  easily  under¬ 
stood  approach  that  would  compare  favorably  with  more  established 
techniques  and,  under  certain  conditions,  outperform  those 
techniques.  The  first  step,  then,  was  to  consider  trial  and 
error  selection. 


4 . 2  Trial  and  Error  Profile  Selection 

We  introduce  here  the  term  "profile”  to  describe  the  set  of 
characteristics  associated  with  cases  from  Dj.  That  is,  a  new 
case  fits  the  "profile"  associated  with  a  rule  D  =  <D 1 ,  D2>  if 
the  vector  of  characteristics,  x,  is  a  member  of  D1# 

An  intuitive  approach  to  developing  a  good  rule  D  might  be 
based  on  trial  and  error  selection  of  possible  variables  and 
variable  values.  One  such  approach  is  described  below,  focusing 
on  the  parameter  estimates  p..  and  q..  for  each  variable  value: 

1.  Examine  the  (p.  .,  q.  .)  combination  for  each  variable 

value  x..,  i=l]J  .  .  .  ^k,  j=l,  ...s.. 

«7  *&• 

2.  Identify  variable  values  with  relatively  high  values 
of  q...  Since  these  variable  values  generally  have  a 
low  bi.lue  of  p.  .,  combine  these  variable  values  with 
the  or  operatoH 


Identify  other  variables  with  relatively  high  values 
of  p..*q...  These  variables  can  then  be  combined  with 
the  Mri&$l  es  in  2.  with  an  and  operator. 


MAXIMUS 


C  \ 

This  process  is  illustrated  below  with  an  actual  example. 

Note  that  the  initial  aim  was  to  develop  high  values  of  p.q 
rather  than  V,  for  example,  since  this  work  preceded  the  develop¬ 
ment  of  more  complex  objective  functions. 

Table  4.1  shows  the  variables  that  were  used  (some  pre¬ 
liminary  work  was  done  to  develop  this  reduced  list)  ranked  in 
order  of  individual  q  values. 


TABLE  4.1 


Variable  # 

SI 

2 

1 

.7500 

.0264 

2 

.5405 

.0488 

3 

.4651 

.0567 

4 

.4595 

.0488 

5 

.4565 

.0607 

6 

.4474 

.0501 

7 

.3968 

.  1662 

8 

.3875 

.1055 

9 

.3818 

.0726 

10 

.3654 

.0686 

11 

.3377 

.1016 

12 

.3367 

.1293 

13 

.3294 

.  1121 

14 

.3290 

.2045 

15 

.3141 

.2519 

16 

.3095 

.2216 

17 

.3076 

.5488 

The  first  trial  profile  was  as  follows:  ( 1  or  2  or  3  or  4 
or  5  or  6)  and  17.  The  reason  for  this  construction  was  that  the 
first  six  variables  had  q  values  above  0.4474,  with  little 
difference  among  variables  3  through  6.  Variable  7  was  notice¬ 
ably  lower  at  0.3968.  Then,  it  was  noted  that  the  q  value  for 
variable  17.  was  not  much  lower  than  the  others  while  P17  was  by 
far  the  highest  (alternatively,  Pi7'Qi7  was  a  maximum).  Thus, 
variable  17  was  included  in  an  and  combination. 

V  _ 


MAXIMUS 


f  \ 

This  profile  had  a  value  for  p  of  0.1121  and  of  q  =  0.5176, 
compared  to  the  random  rate  e  =  0.232. 

Because  p  was  lower  than  desired,  the  next  three  trial  pro¬ 
files  involved  adding  variables  7,  8  and  9  in  turn  in  an  or 
combination  as  follows.  The  fifth  profile  was  just  the  first 
six  variables  connected  with  or.  The  sixth  through  eighth  pro¬ 
files  successively  added  variables  7,  8  and  9  in  an  or  combi¬ 
nation.  Variable  17  was  not  included  in  any  of  these  profiles, 
in  order  to  achieve  a  higher  q. 


Results  for  these  eight  trial  profiles  are  given  in  Table 

4.2: 


TABLE  4.2 


Profile  # 

Profile  Description 

P 

£ 

p*q 

1 

(1  or  2  or  3  or  4  or 
and  17 

5  or  6) 

.1121 

.5176 

.0580 

2 

( 1  or  2  or  3  or  4  or 
or  /)  and  17 

5  or  6 

.1741 

.4242 

.0738 

3 

( 1  or  2  or  3  or  4  or 
or  7  o_r  8)  and  17 

5  or  6 

.1834 

.4101 

.0752 

4 

(1  or  2  or  3  or  4  or  5  or  6 
or  7  or  8  or  9)  and  17 

.2058 

.4103 

.0844 

5 

1  or  2  or  3  or  4  or  5 

or  6 

.1926 

.4521 

.0810 

6 

1  or  2  or  3  or  4  or  5 
or  7 

or  6 

.2995 

.3700 

.1108 

7 

1  or  2  or  3  or  4  or  5 
or  7  or  8 

or  6 

.3193 

.3554 

.1134 

8 

1  or  2  or  3  or  4  or  5 
or  7  or  8  or  9 

or  6 

.3575 

.3506 

.1253 

Exhibit  4.1  plots  the  results  in  terms  of  the  profile  dia¬ 
gram  of  Chapter  II.  The  circled  numbers  represent  the  profiles; 
the  other  numbers  are  the  original  variables.  The  exhibit  shows 
that  the  trial  profiles  generally  outperform  the  single  variables 


I V-3 


MAXIMUS 


r 


A  definition  of  inadmissibility  is  as  follows:  a  profile 
(p,  q)  is  inadmissible  if  3  profile  (p*,  q*)  such  that  p*  ^  p  and 
q* ^ q  and  either  p*>p  or  q*>q.  That  is,  the  profile  (p*,  q*) 
dominates  (p,  q).  In  this  context,  variables  3,  4,  5,  6,  8,  9, 
10,  11_  are  dominated  by  trial  profile  (2)  •  Variables  1_2,  and  _13 
are  dominated  by  profile  (§)  .  Variables  _14,  15  and  16  are  domi¬ 
nated  by  profile  (g)  .  Also,  profiles  (§)  and” are  dominated 
by  (5)  .  The  potentially  admissible  profiles  (or  variables)  are 
1^,  2,  17  and  (I)  ,  (5)  ,  (4)  ,  (6)  ,  (7)  and  (§)  .  Note  that  there 

may  exist  profiles  not  yet  discovered  by  the  analysis  which 
dominate  the  above  profiles. 


Exhibit  4.1  also  illustrates  the  tendency  for  profiles  to 
cluster.  For  example  (2)  ,  ©  ,  (4)  and  (5)  are  fairly  similar 

as  are  (6)  ,  (?)  and  (§)  .  Variables  3,4,  5  and  6  are  alike  also. 

This  is  important  in  practice  for  it  provides  the  user  some 
flexibility  in  choosing  the  final  solution. 

Overall,  however,  the  gain  in  going  from  single  variables  to 
multi-variable  profiles  appears  to  be  small  in  this  application. 
This  suggests  that  the  trial  and  error  approach  is  rather  poor 
as  a  guide  to  combining  variables.  The  reason,  of  course,  is 
that  the  approach  does  not  use  any  information  covering  the 
interrelationships  among  the  variables. 


There  are  two  general  directions  one  can  take  to  improve 
this  situation: 


1)  Generate  a  much  larger  set  of  profiles  from  which  the 
best  may  be  selected. 

2)  Develop  techniques  that  make  use  of  information  on  the 
relationship  between  two  variables. 

In  the  remainder  of  this  Chapter,  we  pursue  the  first  dii ac¬ 
tion.  Chapter  V  introduces  new  techniques  of  the  second  type. 


4 . 3  Monte  Carlo  Approach 

The  Monte  Carlo  approach  is,  in  effect,  a  trial  and  error 
approach  without  human  intervention.  The  computer  is  instructed 
to  pick  variable  values  at  random  and  combine  them  in  a  random 
fashion  to  form  a  profile.  On  repeated  trials,  the  hope  is  that 
a  profile  will  be  found  which  is  sufficiently  close  to  optimal 
for  practical  purposes. 


the 


Theoretically,  the  process  is  as  follows: 
initial  x..,  say,  from  the  k 

n  c 


randomly  select 


I 


I V-4 


MAXIMUS 


MAXIMUS 


possible  variable  values.  Randomly  select  an  operator  from 
the  set  (and,  or,  not).  Then,  randomly  select  a  variable  from 
the  remaining  set  and  continue  until  a  random  number  of  variable 
values  have  been  included.  For  each  profile  so  generated  com¬ 
pute  p,q,  p.q  and  v  (or  any  other  objective  function  of 
interest).  Select  the  profile  that  maximizes  the  objective 
function . 

Note  that  this  procedure  does  not  make  use  of  any  informa¬ 
tion  about  the  interrelationship  of  variables.  Instead  it 
relies  on  the  power  of  repeated  trials  to  locate  the  inter¬ 
relationships  that  have  strong  discriminatory  power. 

Example 

This  example  is  again  taken  from  the  study  of 
Medicaid  cases.  The  actual  procedure  followed  differed 
somewhat  from  the  theoretical  description  above. 

Variable  YT_  was  pre-specif ied  to  be  in  the  profile 
because  it  had  such  a  high  value  of  p.  Also  variable 
1  was  included  because  it  had  a  particularly  high  value 
of  q.  That  is,  the  form  was  specified  to  be  as 
f ol lows : 

17  and  ( 1  or  V*  or  V2  or  ...  or  V  )  where 
Vi  to  Vra  were  to  be  picked  at  random  from 
the  other  15  variables  xj  to  xi5.  7  to  10 
variables  were  to  be  added  for  each  trial, 
with  the  actual  number  selected  being  a 
random  variable  (P(7)  =  P(8)  =  P(9)  = 

P(10)  =  i). 

Clearly,  this  represents  a  rather  constrained  example  of 
the  Monte  Carlo  approach.  By  constraining  the  solution,  the 
number  of  very  poor  profiles  generated  is  reduced  but  there  is 
a  corresponding  loss  of  power. 

Table  4.3  shows  the  results  for  the  nine  profiles  judged 
best  from  the  200  profiles  generated  in  this  fashion. 


MAXIMUS 


TABLE  4.3 


Monte  Carlo 

Profile  Number  Profile  Description 

P 

P'Q 

17  and  (1,4,5,6,9,11,12, 
14,15,16) 

.3391 

.3658 

.1240 

m 

17  and  (1,2,6,8,10,11,12, 
14,15) 

.  3061 

.3707 

.  1135 

a 

17  and  (  1 ,3,4.6,9,11 , 13, 
15,16) 

.  2995 

.3833 

.  1148 

E 

17  and  (1.2,3.4,9.10,12, 
13,16) 

.  2982 

.3850 

.1148 

(3 

17  and  (1,3,5,6.7,9.10,11. 
13,16) 

.  2810 

.4038 

.1135 

a 

17  and  (  1 , 2 . 4 . 6 , S . 9 . 1 2 . 
13,16) 

.  2757 

.4067 

.  1121 

m 

17  and  (1,3,4,5.8.9.10, 
11,12,16) 

.  2704 

.4000 

.  1082 

a 

17  and  (1.2,3,4,9,10,12, 

16) 

.  2573 

.4256 

.  1095 

(3 

17  and  (1,2,4,5.9,10,12, 

16) 

.  2559 

.4222 

.  1052 

To  show  the  improvement  achieved  with  the  random  selection, 
we  compare  the  Monte  Carlo  profiles  with  the  trial  and  error 
prof iles : 


• 

profile 

© 

is  dominated 

by 

profile 

a 

• 

profiles 

(3) 

,  @  and  (2) 

are  dominated  by 

profile 

GEJ 

• 

profile 

© 

is  dominated 

by 

profile 

a 

More  importantly,  perhaps,  no  trial  and  error  profile  domi¬ 
nates  a  Monte  Carlo  profile.  Exhibit  4.2  shows  the  Monte  Carlo 


I V-7 


MAXIMUS 


MAXIMUS 


r  ^ 

profiles  [  [4]  and  [7]  are  not  shown  since  the  results  are  so 
similar  to  13)  and  i6)  respectively].  Some  of  the  trial  and  error 
profiles  are  shown  for  comparative  purposes. 

Of  course,  these  results  are  not  unexpected  since  these  are 
the  best  9  out  of  200  Monte  Carlo  trials,  whereas  there  were 
only  8  trial  and  error  profiles.  Nonetheless,  it  does  demon¬ 
strate  the  general  assumption  underlying  the  Monte  Carlo  approach: 
the  greater  the  number  of  trials,  the  greater  the  chance  of 
finding  a  good  profile. 

In  examining  Table  4.3,  alternative  ways  of  improving  the 
results  may  be  considered.  For  example,  since  the  9  profiles 
shown  there  are  the  "best"  (at  least  in  terms  of  p.q),  we  could 
identify  common  characteristics  of  these  profiles.  Table  4.4 
displays  the  results  by  variable  for  the  9  profiles. 

TABLE  4.4 


The  results  show  that  variables  9  and  16_  appear  in  8  out  of 
the  9  profiles  and  always  together.  Variable  7,  on  the  other 
hand,  appears  only  once,  and  variable  L4  only  twice.  This 
suggests  that  new  profiles  could  be  generated  using  a  smaller  set 
of  variables.  For  example,  restricting  the  selection  to  only 


-9 


MAXIMUS 


r  \ 

those  variables  appearing  6  or  more  times,  we  obtain  the  reduced 
set  [l,  4,  9,  10,  12,  16]  . 

One  of  the  advantages  of  the  Monte  Carlo  approach  is  the 
information  provided  by  the  profiles  randomly  generated.  It  is 
much  easier  to  examine  the  best  profiles  developed  to  determine 
clues  as  to  effective  combinations  of  variables.  In  other  words, 
the  Monte  Carlo  approach  can  be  linked  to  an  approach  involving 
human  intervention,  taking  advantage  of  the  computer's  power  to 
identify  promising  profiles. 

4.4  Evaluation 


The  Monte  Carlo  approach  is,  of  course,  Ex  Post.  Thus, 
there  is  always  the  danger  that  the  "best"  Monte  Carlo  profile 
merely  fits  the  sample  data  rather  than  reality.  Two  methods  may 
be  adopted  to  test  the  results: 

1)  Develop  the  profile(s)  on  a  subset  of  the  data  base, 
then  test  them  on  the  remainder  of  the  data  base. 

2)  Develop  the  profile(s)  on  one  data  base  and  then  test 
them  on  another  sample  from  the  same  population. 

In  this  section,  we  discuss  an  empirical  test  of  the  Monte 
Carlo  approach  based  on  the  second  method.  Again  returning  to 
the  New  Hampshire  Medicaid  example,  profiles  were  developed  based 
on  a  sample  taken  from  four  District  Offices.  These  profiles 
were  then  applied  to  a  later  sample  of  Medicaid  cases  drawn  from 
the  same  four  offices.  Using  these  data,  we  compare  the  actual 
versus  predicted  performance. 

Table  4.5  shows  the  values  of  p,q  and  7  (denoted  p^ ,  q„,  7  ) 
developed  from  the  original  sample  and  resulting  values  obtained 
when  the  profiles  were  applied  to  a  new  sample  of  cases.  Sepa¬ 
rate  profiles  were  used  for  the  four  District  Offices  for  each 
of  two  population  types:  AI  =  Adult  Independent  and  NH  =  Nursing 
Home.  The  value  of  e  was  assumed  to  have  remained  the  same. 


I  V-  \  0 


MAXIMUS 


Test  Results 


Manchester  NH  369 
AFDC  137 


Concord 


Berlin 


Conway 


NH  194 
AFDC  64 


AFDC  50 


AFDC  28 


139  j  .800 
179  j  .600 


Comparison  of  P  Values 

Treating  p  as  a  sample  proportion,  it  is  possible  to  test 
the  hypothesis: 

H0  :  PA  =  PB  =  p  vs.  Ha:  pa+pb- 
That  is,  reject  Hq  at  the  5%  level  if 


PA  <  PB  -  1-645 


p(l-p)  /1  +  1\ 
\n  m) 


where  p  is  the  pooled  estimate  of  P,  and  n  and  m  are  the  respec¬ 
tive  sample  sizes. 

Applying  this  test  to  the  sample  data,  Hq  could  not  be  re¬ 
jected  for  any  of  the  eight  tests.  In  other  words,  there  is  no 
evidence  to  support  the  contention  that  the  proportion  of  cases 
fitting  the  profile  is  less  than  expected. 

Comparison  of  Q  Values 

Treating  Q  as  a  binomial  proportion  within  the  reduced  set 
of  cases  fitting  the  profile,  we  can  perform  a  similar  test  of 
the  hypothesis: 


QA  =  QB  =  Q  vs.  Ha: 


IV- 11 


qa  <  qb 


MAXIMUS 


* 


^Reject H^at the  5^  level 


qA  <  qB  ~  i-645 


if 


v/qTi-q) 


(“*A  “*B) 


where  q  is  the  pooled  estimate  of  Q,  and  n*^,  n*R  are  the  number 
of  sample  cases  fitting  the  profile  in  each  sample. 

Again,  Ho  could  not  be  rejected  for  any  of  the  eight  tests. 
However,  for  two  of  the  tests  the  effective  sample  sizes  were 
too  low  for  a  reasonable  test.  Nonetheless,  5  of  the  8  q  values 
were  above  the  corresponding  qR  values.  Thus,  the  profile  per¬ 
formed  about  as  expected  with  respect  to  the  parameter  q. 

Comparison  of  v  Values 

Although  the  profiles  were  not  developed  to  maximize  V,  it 
is  instructive  to  compare  the  realized  values  of  7  in  the  two 
samples . 

First,  it  is  interesting  to  note  that  7  appears  to  be  in¬ 
versely  correlated  with  the  value  of  e.  That  is,  the  lqwer  the 
observed  value  of  e,  the  greater  the  absolute  value  of  7.  This 
is  intuitively  reasonable  since  there  is  greater  potential  to 
increase  q  relative  to  e  when  e  is  small.  The  value  of  7  is, 
of  course,  sensitive  to  the  difference  q-e. 

Second,  the  observed  values  vR  are  greater  than  their 
counterparts  7^  in  5  of  the  8  comparisons.  This  is  a  surprising 
finding,  givenrtthat  we  would  expect  the  estimates  v„  to  be  opti¬ 
mistically  biased  given  that  the  profiles  were  developed  Ex  Post 
on  relatively  small  sample  sizes. 

Because  the  sample  sizes  for  each  office  in  the  original 
sample  were  not  available,  it  was  not  possible  to  test  the 
hypotheses  =  v  Furthermore,  since  the  value  of  e  was 
assumed  to  be  thesame  in  each  sample,  the  tests  would  not  be 
accurate . 

Comparison  with  Other  Results 

Because  other  techniques  have  been  used  for  the  same  problem 
of  detecting  ineligible  Medicaid  cases,  it  is  instructive  to 
compare  their  performance  with  the  results  here. 

For  example,  South  Carolina  and  the  District  of  Columbia 
have  used  discriminant  analysis  with  the  results  as  shown  in 
Table  4.6. 

L _ J 


MAXIMUS 


TABLE  4.6 


p 

q 

e 

V 

0.2 

0.38 

0.19 

0.242 

0.2 

0.18 

0.13 

0.072 

0.2 

0.45 

0.24 

0 . 244 

0.2 

0.14 

0.07 

0.116 

South  Carolina 
District  of  Columbia  (1) 
District  of  Columbia  (2) 
District  of  Columbia  (3) 


The  three  sets  of  results  for  D.C.  refer  to  three  different  types 
of  eligibility  error  types.  The  results  shown  are  the  results 
based  upon  the  sample  used  to  develop  the  discriminant  function. 
Therefore,  the  actual  results  may  be  different.  To  compare  these 
results  with  the  Monte  Carlo  results,  we  examine  instances  where 
the  values  of  e  are  similar: 

•  For  values  of  e  of  0.176  and  0.125,  the  Monte 
Carlo  results  (as  shown  in  Table  4.5)  were 
values  of  vB  of  0.464  and  0 . 607  respectively; 
whereas,  for  discriminant  analysis,  values  of 

e  of  0.13  and  0.07  were  associated  with  v  values 
of  only  0 .072  and  0.116  respectively. 

•  For  values  of  e  of  0.355  and  0.303,  the  Monte 
Carlo  achieved  values  of  0 . 383  and  0.470; 
whereas,  for  discriminant  analysis,  values  of 
e  of  0.19  and  0.24  had  y  values  of  0.242  and 
0.244. 

Because  these  results  were  based  on  samples  from  different 
States,  definitive  conclusions  cannot  be  made.  Nonetheless,  it 
appears  that  the  Monte  Carlo  approach  outperforms  the  discrimi¬ 
nant  function  approach. 

The  Social  Security  Administration  uses  the  AID  algorithm 
to  help  identify  ineligible  Supplemental  Security  Income  cases. 
They  achieved  the  following  results: 

p  =  0.11 

q  =  0.57 
e  =  0.2 

V  =  0.33 


MAXIMUS 


(  -  > 

The  value  of  v  is  greater  than  the  values  obtained  by  the 

discriminant  analysis  approach  for  similar  values  of  e.  This  is 
not  unexpected  since,  as  we  pointed  out  in  Chapter  III,  the  AID 
algorithm  came  closest  to  satisfying  the  desired  properties  of 
a  screening  procedure.  Furthermore,  the  Monte  Carlo  approach 
appears  to  outperform  the  AID  algorithm  indicating  that  it  may 
be  a  viable  alternative.  In  Chapter  V,  we  introduce  further 
refinements  to  the  Monte  Carlo  approach  which  perform  even  better 
than  the  figures  presented  here. 

We  should  also  point  out  that  the  Monte  Carlo  profiles  were 
not  selected  to  maximize  V.  ^hus ,  we  would  expect  that,  if 
maximization  of  V  had  been  the  specified  objective  function,  the 
resulting  v  values  would  have  been  higher. 

These  comparisons  are  not  rigorous.  A  more  rigorous  com¬ 
parison  would  require  the  application  of  each  technique  to  the 
same  data  base.  Because  of  the  limited  resources  for  program¬ 
ming,  we  decided  to  defer  testing  of  the  algorithms  until  the 
new  methods  presented  in  Chapter  V  could  be  programmed.  As  will 
be  seen,  the  programming  task  is  quite  extensive. 

4 . 5  Summary 


In  this  Chapter,  we  have  presented  a  new  type  of  statistical 
screening  approach  that  takes  advantage  of  the  power  of  the  com¬ 
puter  to  identify  effective  combinations  of  variables.  The 
approach  involves  no  assumptions  about  the  underlying  distri¬ 
bution  of  variables.  It  classifies  every  observation  into  one 
of  the  two  populations.  Despite  the  fact  that  the  methodology 
was  developed  from  an  intuitive  basis,  without  any  closed-form 
analysis  to  demonstrate  the  "optimality"  of  the  approach,  the 
empirical  results  suggest  that  the  approach  is  highly  effective. 

In  particular  we  have  shown  that  the  sample-based  results  were 
duplicated  in  an  actual  test  of  the  profiles  on  a  new  sample. 

The  approach  also  satisfies  the  properties  introduced  in 
Chapter  III,  namely: 

•  Property  1 :  The  methodology,  by  construction, 
is  clearly  unaffected  by  changes  in  the  order¬ 
ing  of  categories  of  each  variable  since  the 
approach  randomly  picks  variable  values  from 
the  available  ones. 

•  Property  2 :  The  approach  can  handle  any  ob¬ 
jective  function  that  is  based  on  sample 
statistics  since  it  automatically  calculates 
the  objective  function  for  each  random  profile 

L  J 


MAXIMUS 


-  ^ 

developed.  Constraints  can  be  handled  easily 

by  choosing  the  best  profile  among  those 
satisfying  the  constraint. 

•  Property  3:  The  results  are  easy  to  use  and 
interpret.  The  solution  is  in  the  form  of  a 
sequence  of  variables  linked  by  and  or  or . 

The  user  need  only  check  the  characteristics 
of  each  new  case  against  the  logical  expression. 

No  mathematical  computations  are  required. 

•  Property  4:  The  approach  can  handle  any  sample 
sizes,  although  the  advantages  of  the  technique 
over  other  techniques  such  as  the  LDF  tend  to 
diminish  as  sample  size  increases.  For  example, 
the  LDF's  assumption  of  multivariate  normality 
becomes  less  problematic.  For  small  sample 
sizes,  the  technique  has  clear  advantages  over 
approaches  based  on  the  multinomial  model  which 
suffers  from  state  sparseness. 

•  Property  5:  The  approach  handles  interactions 
explicitly  since  it  seeks  out  effective  com¬ 
binations  of  variable  values. 

Despite  the  strengths  of  the  Monte  Carlo  approach,  we 
believe  it  barely  scratches  the  surface  of  what  appears  to  be  a 
new  direction  for  screening  problems  with  qualitative  variables. 

In  Chapter  V,  we  develop  the  initial  concepts  embodied  by  the 
Monte  Carlo  approach  into  a  much  wider  class  of  procedures. 


IV- 15 


t 


MAXIMUS 


r 


V\  NEW  SCREENING  PROCEDURES ^ 


5.1  Introduction 


In  Chapter  II,  we  provided  some  background  to  the  screening 
problem,  including  an  overview  of  the  various  objective  functions 
that  could  be  relevant.  In  Chapter  III,  we  reviewed  some  of  the 
established  screening  techniques  and  demonstrated  that  none  of 
these  techniques  were  entirely  satisfactory  under  certain  con¬ 
ditions.  We  also  suggested  five  properties  that  a  screening 
technique  should  have.  In  Chapter  IV,  we  presented  an  initial 
effort  to  apply  a  new  technique  that  was  designed  specifically 
for  the  problem  under  consideration.  Results  of  that  application 
were  also  presented.  In  this  chapter,  we  build  upon  the  work  of 
those  chapters  in  order  to  propose  a  new  group  of  screening 
procedures.  First,  we  consider  sequential  algorithms  for  general 
and  specific  objective  functions.  Then,  we  develop  procedures 
with  a  pre-specif ied  form  of  output. 

5 . 2  The  General  Sequential  Algorithm 

As  discussed  in  Chapter  III,  it  is  important  that  a  screening 
algorithm  be  amenable  to  different  objective  functions.  Also,  in 
Chapter  II,  we  showed  that  most  objective  functions  of  interest 
are  well-behaved  functions  of  the  parameters  P,  Q  and  E.  Thus, 
we  can  assume  a  general  objective  function  that  is  to  be  maxi¬ 
mized  (without  loss  of  generality): 

0  =  f(P,  Q,  E).  (1) 

Since  we  are  dealing  with  sample  data,  the  aim  of  the 
algorithm  is  to  find  the  maximum  of  the  sample  value  of  0,  that 
is  to  find 

0*  =  f(p*,  q* ,  e)  =  f(p,  q,  e).  (2) 

For  a  given  sample,  e  is  fixed.  Thus,  the  maximization  takes 
place  over  the  (p,  q)  combinations  associated  with  each  possible 
decision  rule  D  in  D. 

As  usual,  assume  we  observe  a  sample  of  size  n  from  n  = 

< n i ,  n2> ,  with  each  observation  described  by  k  variables  Xj . 

X.  where  each  variable  X.  may  take  on  s.  values,  x., . x. 

k  t  v  il  ts . 

i 

Then,  the  General  Sequential  Algorithm  can  be  described  as 
follows: 

J 


MAXIMUS 


r 


i) 


3) 


Choose  the  variable  value  x-£ -•  or  its  complement  to 

maximize  6.  That  is,  assume  x_  _  =  x ' 1  is  such  that 


m ,  n 
max 


0  U*  = 

xm ,  n  j  x  v  , x  .  .  } 


=  x  ' 1  i 
0 


x . 


(3) 


That  is 


0  U) 

x 

m ,  n 


is  the  maximum  value  of  0  when  evaluated  for  all  2‘Es-; 
values  that  can  occur.  The  decision  rule  i= 1 


D ( 1 )  =  <D  j 


I  1  ! 


is 


d/1) 

ii 

X 

=  X 

(  1 

m 

m ,  n 

D  il) 

il 

x 

m 

4=  x 

1  m ,  n 

2)  The  next  variable  to  enter  is  that  variable  value  xef 
which  maximizes  0  over  all  possible  combinations  xm > n 
xef  >  xm , n  *ef.  xm,n  and  not  xef.  xmn  or  not  xef. 

Define  the  resulting  composite  variable  as  x12*  ;  for 
example,  if  the  maximum  occurs  for  xm  >  n  and  not  xef,  then| 

+  xef 

(4) 


x(2)  = 


I'  1 

(0  o 


f  *m  = 


xn  and  Xe 


otherwise 


and 


D 

D 


'  2 )  =  <D , 


D2  (  2 ’ >  with 


=  {; 


x  x 


=  ll 


D' 


,2)  =  |  x  |  x  (  2  )  =  0  I 


The  procedure  continues  in  this  fashion  until  a  stopping 
rule  halts  the  process  at  stage  s.  Some  possible 
stopping  rules  are  discussed  later. 


5.2.1  Features  of  the  General  Sequential  Algorithm 


Although  sequential  algorithms  are  not  new  (for  example, 
the  AID  algorithm  is  sequential),  the  General  Sequential  Algorithm 
presented  here  has  certain  interesting  features: 


•  The  algorithm  searches  for  "good"  (in  terms  of  the 
pre-speci f ied  objective  function)  variable  values 
rather  than  variables.  This  is  a  desirable  feature 
since  we  are  dealing  with  qualitative  variables. 


V-2 


MAXIMUS 


The  algorithm  is  powerful  in  that  it  allows  the 
maximization  to  take  place  over  all  four  possible 
logical  combinations.  The  AID  algorithm,  for 
example,  considers  only  the  and  combination. 


•  The  procedure  makes  no  distributional  assumptions. 


•  The  procedure  automatically  considers  combinations 
of  values,  although  in  a  sequential  fashion. 


•  The  procedure  is  not  too  sensitive  to  the  first 
variable  selected  since  it  allows  the  solution  to 
shift  away  from  the  original  variable  value  via 
the  or  operator. 


5.2.2  Properties  of  the  General  Sequential  Algorithm 


More  specifically,  the  algorithm  satisfies  the  properties 
of  Chapter  3: 

Property  1 :  The  procedure  is  clearly  not  dependent  on  the  order 
of  categories  of  each  variable  since,  at  each  stage,  all  possible 
variable  values  are  considered. 

Property  2:  By  construction,  the  procedure  handles  any  of  the 
objective  functions  discussed  in  Chapter  2,  including  those  with 
constraints  (see  5.6). 

Property  3:  The  form  of  the  output  is  a  sequence  of  character¬ 
istics  linked  by  logical  operators.  To  allocate  a  new  case,  the 
user  merely  matches  the  case's  characteristics  to  this  "profile." 
No  mathematical  computations  are  required  once  the  profile  has 
been  constructed. 

Property  4:  The  procedure  can  be  used  with  any  sample  size  but, 
unlike  most  other  techniques,  it  may  be  most  effective  with  small 
sample  sizes  because  of  the  computational  effort  involved  in  the 
maximization  at  each  stage. 

Property  5:  The  procedure  handles  interaction  effects  by  seeking 
effective  combinations  of  values.  However,  the  sequential 
approach  significantly  reduces  the  interactions  to  be  studied. 

5.2.3  Computational  considerations 

Despite  the  flexibility  inherent  in  the  General  Sequen¬ 
tial  Algorithm,  the  computational  burden  does  not  appear  too 
great.  At  any  stage  h,  the  objective  function  O^*1)  must  be 


-3 


MAXIMUS 


c  ~ 

I  computed  for  at  most  4  Z  s. 

i=1  i 


logical  combinations  (since  some  combinations  may  not  be  feasible, 
e.g.,  and  not  xvj).  As  we  have  seen,  the  objective  function 

is  generally  a  simple  function  of  P,  Q,  E  (p,  q  and  e). 


However,  from  a  programming  standpoint  the  problem  is 
more  difficult.  Consider  just  the  parameters  P  and  Q.  Let  x^“4 
be  the  composite  variable  defined  at  stage  h.  Now  we  examine  all 
variable  values  to  maximize  0^+^  over  all  possible  combinations. 
For  variable  value  x.-  ,•  and  combinations  or,  we  have 


P  ( x  ^  ')  is  known  from  computations  at  stage  h  while  P  ( x; )  is 
known  from  the  original  data.  For  P(x(h)  fl  xij)  ,  the  computer 
must  scan  through  all  cases  and  count  those  cases  for  which  the 
composite  variable  x(h)  =  1  and  xtj  =  1. 


Similarly , 

q  =  p(n  i  |x(h)  or  x.  .) 

=  P  (n  i  and  |Vh)  or  x  .  .]  )  (2) 

P  (  x  ^  ^  )  0_T  x  .  .  ) 

'  -  v 


The  denominator  of  (2)  is  available  from  (1).  The  numerator  is 
found  by  counting  those  cases  from  n j  which  have  characteristics 
x  ( h  )  x  .  . . 

In  effect,  the  sample  is  partitioned  at  each  stage  into 
those  cases  for  which  the  composite  variable  x^*1)  =  \  and  those 
where  it  is  zero.  Thus,  the  computer  must  partition  the  data 
base  and  provide  counts  for  combinations  of  interest.  The  pro¬ 
gramming  for  this  effort  turns  out  to  be  very  complex. 


5.2.4  Stopping  rules 

In  general,  the  sequential  procedure  would  continue 
through  stage  s,  say,  where  s  is  determined  by  a  rule  such  as  the 
following: 

1)  ><0*  *  0 (s  )  ,  wh  ere  0*  is  a  pre-spec i f ied  target 

value  of  the  objective  function. 

V _ _ _ J 


MAXIMUS 


f  \ 

2)  The  maximum  number  of  variable  values  to  be  allowed 
in  the  profile  is  s. 

3)  The  marginal  improvement  offered  by  the  next  variable 
is  below  a  pre-specif ied  value,  c,  say: 

q(s)  _  6(s-1)  <  c. 

0(S_1 ^ 

A  number  of  variations  on  the  above  rules  are  possible. 
Thus,  the  user  may  specify  the  stopping  rules  appropriate  for  the 
problem  under  consideration. 

In  the  next  sections,  we  consider  special  cases  of  the 
General  Sequential  Algorithm. 

5 . 3  Sequential  Algorithm  for  v 


Recall  that  the  proportionate  reduction  of  error  measure 
was  defined  to  be 


_  2P(Q-E) 

E+P(1-2E) 


f(P,Q,E) 


For  the  sample,  the  objective  function  to  be  maximized  is 

u  =  2p( q-e ) 

e+p(l-2e) 


To  understand  the  sequential  procedure,  consider  Table  5.1 
below  which  shows  the  sample  counts  for  each  cell 


TABLE  5.1 

(h  ) 

x '  '  =  1 

j 

x<h>  =  0 

xkc=1 

xkc=0 

Xkc=1 

xkc"° 

0) 

CD 

© 

© 

© 

© 

@ 

where  x  ^  ^  is  the  composite  variable  defined  at  stage  h,  and  x^c 

is  another  variable  value  being  considered  for  combination  with 
x<h  >  at  stage  h  +  1 . 


?-5 


MAXIMUS 


According  to  the  General  Sequential  Algorithm,  four  combi¬ 
nations  may  be  considered.  For  each  combination,  we  count  the 


errors  in 
errors  are 

terms  of 
made)  : 

the  error  cells  (cells  where  classification 

Error  Cells 

(i) 

(h)  , 
x v  =1 

and  x^c=l 

© 

+  © 

+  (D  + 

© 

(1) 

(ii) 

(h)  , 
xv  -1 

and  x,  =0 

+  © 

+  ©  + 

© 

(2) 

(iii) 

x( h)=l 

or  xkc»l 

© 

+  (6) 

+  (7)  + 

© 

(3) 

(iv) 

x(h©l 

—  xkc°° 

CD 

+  (D 

+  ©  + 

© 

(4) 

For  each  combination,  there 

is 

a  correspondi ng 

value  of 

V  . 

That  is ,  from  ( 1 ) ,  . 

2(®  ♦  ©)  LJZ- 


(i) 


e  + 


Ft 


m 


-  e 


,  *  ©)  [ft 

1  1  1  - r-n* - --Va 


("©  +  ©)  (  l-2e) 

& 


(ii) 


-  e 


_  + 

e  +  (©  +'  (§))  (  1  -  2  e ) 


(5  ) 


(6) 


7 ,  ,  , x  =  2(G)+©  +  ©  +  (D+©+CD)  (q  +  ~ e 


(  iii  ) 


e  +  (®+©+(3)+(D+®+©)  (l-2e) 


(7) 


(  iv) 


e  +  (@+(D  +  (4)  +  (D+(D+(D)  ( 1  -2e  ) 


(8) 


Thus,  x.  improves  the  prediction  in  terms  of  V  if 

Kc  y  v 

(h)  _ 


max 


V  7  7  7  )  >  V 

(i)’  (ii)’  (iii)’  (iv)/ 


2  (©+©+©+  ©) 


Furthermore,  7 

one  x..  ( i  =  l  ,  ...,  k..  =  l, 

IrJ  t-J 


e  +  (©+©)+  (D  +  ©)  (l-2e) 
(h+1) 


(  9) 


9(h) 


as  long  as  there  exists  at  least 
s..  )  for  which  inequality  (9)  holds. 


Expressions  (5)  through  (8)  can  be  written  more  formally  in 

terms  of  the  sample  counts  n.  .  as  follows: 

t-j 


T-6 


MAXIMUS 


r  .  \ 


V(i)  = 

_ l  -  l11!  ' 

n i .  +  n.i  -  2n.ini. 

(10) 

u  = 

2n? i  -  2n • ? n i • 

(11) 

(ii) 

n i  .  +  n . 2  -  2n.2n1. 

V  = 

2(m  •  -  n14)  -  2(1  - 

n  •  4  )  (  n  i  •  ) 

(12) 

(  iii) 

ni .  +  (1  -  n . 4 ) ( 1  - 

2n  i  .  ) 

_  2(n, ■  -  n, 4 )  -  2( 1 

-  n  •  4  )  (  n !  •  ) 

(13) 

1  +  n i .  -  n . 4  -  2(1 

-  n  .  4  )  (  n  j  .  ) 

V  = 

2(  n  i  -  -  nn)  -  2(1  - 

n.  OCni  .  ) 

(14) 

(iv) 

1  +  nj .  -  n . 3  2(1- 

n.  3 )  (  n l • ) 

9<h)  = 

2(n, ,  +  n, ?)  -  2(n. ! 

+  n .  p  )  ( n,  .  ) 

(15) 

n i •  +  n • i  +  n-2  -  2(n 

•  l  +  n  •  2  )  ( n !  •) 

Minimize 

Probability  of  Misclassif ication 

As  shown  in  Chapter  II,  the  probability  of  misclassif ication 
can  be  written  in  terms  of  P,  Q,  E  as 

P(M)  =  E  +  P(  1  -  2Q )  (1) 

Minimizing  P(M)  is  equivalent  to  maximizing  the  probability 
of  correct  classification, 

P(C)  =  1  -  P(M)  =  1  -  E  -  P(1  -  2Q ) 

=  f CP,  Q-  E)  =  0p  (2) 

To  further  explore  the  nature  of  the  sequential  algorithm, 
we  again  refer  to  the  2x4  table: 


( h )  , 
x  =1 

x<h>=0 

xkc=1 

xkc‘° 

_ 

Xkc=1 

_ 

xkc=0 

© 

© 

© 

Assume  x'  '  is  the  composite  variable  defined  at  the  hth  stage 
and  xjtc  is  a.  new  variable  value  being  considered  at  the  (h+l)th 


MAXIMUS 


(h) 

For  Op^  <  Op  ‘  we  must  have,  for  all  possible  variable 
values  x^c,  and  for  each  of  the  four  combinations,  the  sum  of  the 
error  cell  counts  greater  than  the  sum  for  x(h).  Thus: 


Combinations 

Error  Cell 

Counts 

i) 

x^h^=l  and  xkc=l 

© 

+  © 

+ 

© 

+  @ 

(3) 

ii) 

x^)  =  l  anci  xkc=0 

© 

+  © 

+ 

© 

+  (4) 

(4) 

iii) 

(h)  . 

xv  -1  or  xkc=l 

© 

+  © 

+ 

© 

+  © 

(5) 

iv) 

x ^ 1  © l  or  xkc=0 

© 

+  © 

+ 

© 

+  © 

(6) 

Therefore,  each  of  (3 ) , ( 4 ) , ( 5 ) ,  (  6 )  must  be  greater  than  the 
error  cell  count  for  x(h),  which  is(5)  +  ©+  ®  +  (4).  This 

implies  the  following  must  hold  simultaneously: 

©  >  © 

0)  >  ©  (7) 
©  >  ® 

®  >  © 

Chi 

That  is,  each  of  the  error  cells  created  by  xl  ;  must  be  less 
than  the  corresponding  error  cell  in  the  other  row.  This  is 


i  by  the  following 

example : 

( h )  , 
x  =1 

x(h> 

=0 

;kc=1 

o 

ii 

o 

X 

xkc=1 

Xkc=° 

0.20 

0 . 20 

0.05 

0.05 

.5 

0 . 10 

0.10 

0.10 

0 . 20 

.  5 

0 . 30 

0.30 

0.15 

0 . 25 

1.0 

Clearly,  the  conditions  in  (7)  are  satisfied,  and 
0p(h)  =  1  -  P(h)(M)  =  1  -  (0.1+0.1+0.05+0.05)  =  0.7 


V-8 


rr 


MAXIMUS 


0 


(h+1) 


p,x 


kc 


=  max 

(l-P<h+1) 
\  xkc 

(M)J 

=  1  - 

min  ((3), 

(4), 

(5)  , 

(6)) 

=  1  - 

min  ( . 40 , 

.40, 

.  35  , 

.45) 

=  1  - 

0.35  =  0. 

65. 

Thus,  0(h+1)  <  0( 


h) 


p,x 


(8) 


kc 


This  demonstrates  that  6  is  not  always  improved  by  the 
addition  of  another  variable  yalue.  However,  for  the  algorithm 
not  to  result  in  an  improved  6^^  at  each  stage,  condition  (8) 
must  hold  for  all  possible  variable  values,  i.e., 


max 

|Xkc} 


0 


(h+1) 


;(h) 


p,x 


(9) 


kc 


Given  the  restrictiveness  of  the  conditions  in  (7)  and  the 
number  of  variable  values  available  at  each  stage,  it  is  highly 
unlikely  that  (9)  would  hold  until  the  solution  is  near  optimum. 

The  error  cell  counts  shown  in  (3),  (4),  (5)  and  (6)  also 
demonstrate  that  the  effect  of  a  new  variable  value  is  to  shift 
exactly  one  error  cell  from  one  row  to  another.  For  example, 
combination  i)  shifts  the  error  cell  count  from  ©  +  ©  +  (3) 

+  @  to  ©  +  ©  +  ©  +  ©  .  Hence  the  contribution  of 

variable  value  xkc  in  combination  i)  is  ©  -  ©  (if  the  result 
is  negative,  the  variable  value  does  not  help  in  this  combination) 
The  marginal  contribution  of 


x,  ,  at  stage  h+1  is,  therefore, 
kc 


Q(h+1)  _  Q(h) 

xkc 


=  max 


-  ©  ,  ©-©.©- 


-(D)  (10) 


This  can  be  defined  more  formally  by  writing  Table  5.1  in 
terms  of  the  join  probabilities 


P  .  ,  where  /a 

abc 


=  1  for  n 1 
=  2  for  rr> 


b  =  1  if  xfUJ  =  1 

=  0  if  x(h)  =  0 


c  =  1  i  f  x 
=  0  if  x 


kc 

kc 


1 

0 


with  sample  estimates  P 


abc ' 


V-9 


MAXIMUS 


Then , 


P20 i_pi o i 


o(h  1)  _  0(h)  _  max  fp110_p210)  P i i i — P 2 1  1  r  P20 l  p l o i > 
kc  v ,  s 

P200-P10o)  (11) 

=  max  ( p i i c— p  2 l c  <  P20c-Pl0c)  (12) 

c=l , 2  x  ' 

Thus,  the  sequential  algorithm  can  be  seen  as  a  search,  at 
each  stage,  for  the  variable  value  xjcc  with  the  largest  marginal 
contribution  as  given  by  (12).  Equivalently,  the  search  is  for 
the  variable  value  for  which  one  of  the  four  differences  is  a 
maximum . 

Recall  from  Chapter  II  that  minimizing  the  probability  of 
misclassif ication  is  equivalent  to  maximizing  R2 .  Thus,  the 
sequential  algorithm  for  this  situation  is  analogous  to  stepwise 
selection  in  regression  analysis  where  the  objective  is  to  intro¬ 
duce  the  new  variable  that  maximizes  R2 ,  given  the  variables 
already  included.  The  comparison  is  not  exact,  however,  since  in 
our  context  we  introduce  variable  values  as  a  logical  combination 
with  the  previous  composite  indicator  variable,  rather  than 
introducing  variables  to  form  a  new  regression  equation. 

5 . 5  Maximize  P.Q 

As  discussed  in  Chapter  II,  an  objective  function  that  is  a 
good  proxy  for  more  complicated  functions  is  0  =  f(P,Q,E)  =  P.Q. 
That  is,  "good"  profiles  tend  to  have  relatively  high  values  of 
P  and  Q.  Referring  to  Table  5.1  again,  we  see  that 


>(h)  _ 


-  G>  + 


+  (5)  +' 


CO  +  (2 j 

co  +  (2)  +  (B  +  (e: 


The  corresponding  values  of  +  ^  for  the  four  combinations  are 


i) 

x<h)=l 

and  xkc=l 

p-q 
©  /  n 

ii) 

x<h)=l 

and  x,  =0 
-  kc 

(D  /n 

iii) 

x  =1 

2£  xkc=1 

© +  ( 

iv) 

(h)  , 
x  =  1 

xkc"° 

! 

MAXIMUS 


From  (4)  and  (5),  0(h+1)  >  0(h)  unless  ®  =  @  =  0.  In 
other  words,  the  objective  function  p.q  can  never  be  decreased  by 
the  addition  of  another  variable  in  the  or  combination.  In  fact, 
the  General  Sequential  Algorithm  will  force  the  ultimate  solution 
to  be  a  sequence  of  variable  values  linked  by  the  or;  operation. 

Furthermore,  we  have  seen  that  pq<e.  Yet,  the  available  and 
trivial  solution  p  =  1,  q  =  e,  is  such  that  pq  =  e .  No  other 
solution  could  perform  better  in  terms  of  this  objective  function. 

For  these  reasons,  pq  is  considered  poor  as  an  objective 
function  if  the  sequential  algorithm  is  to  be  applied,  although 
the  statistic  pq  is  interesting  since  it  reflects  the  expected 
yield  of  cases  from  rij.  A  more  useful  objective  function  is  to 
maximize  pq  subject  to  p^P*  where  P*  is  a  pre-specif ied  value. 

In  the  section  that  follows,  we  discuss  the  sequential  algorithm 
when  there  are  constraints  on  the  objective  function. 

5 . 6  Constrained  Objective  Functions 

In  many  practical  problems  where  the  aim  is  to  identify  cases 
in  n j  so  that  they  may  be  treated  differently  from  cases  in  n2, 
resource  constraints  may  put  a  limit  on  the  value  of  p.  In  some 
instances,  the  constraint  may  be  even  more  severe.  For  example, 
if  the  staff  available  for  cases  in  n j  is  fixed,  then  the  value  of 
p  should  not  be  too  high  (additional  staff  needed)  or  too  low 
(staff  must  be  laid  off).  Thus,  there  may  be  a  target  P=P* ,  say. 

When  P  is  fixed,  most  of  the  objective  functions  already 
studied  become  equivalent.  That  is,  the  aim  is  to  maximize  Q 
subject  to  fixed  P  =  P* .  In  practice,  we  may  wish  to  specify  an 
allowable  range  for  the  realized  (sample)  value  of  p,  e.g., 

P*  -  e  <  p  <  P*  +  e  ( 1 ) 

where,  for  instance,  we  could  set 

£  =  Z  /P*(l  -  P*~  (2) 


where  Z  is  the  standard  normal  deviate  associated  with  a  prob¬ 
ability0^!  a/2  in  the  tail  of  the  distribution  or  e  =  cP* ,  where 
c  is  a  subjectively  determined  constant  (e.g.,  c  =  0.10). 

The  Generalized  Sequential  Algorithm  could  be  applied  to  the 
constrained  objective  function  so  that,  at  each  stage,  the 
solution  fell  within  the  constraints.  However,  this  appears  to 
be  unduly  restrictive  since  it  is  only  the  last  stage  solution 
that  needs  to  fall  within  the  bounds  set  in  (1).  Thus,  it  may  be 
appropriate  to  develop  modifications  to  the  General  Sequential 
l  Algorithm.  J 


MAXIMUS 


5.6.1  Maximize  0  subject  to  P  =  P* 

Algorithm  5. 6. 1.1 

Assume  that  we  aim  to  have  P*-e<p<P*+e. 

Step  1 :  Select  X£j  to  maximize  q^j.  In  practice,  several  x,-  • 
may  have  a  of  1.  In  the  case  of  ties,  select  the 

value  with  the  highest  value  of  Pfj- 

Step  2:  a)  If  p.  .  <  P*  -  e,  select  x,  to  maximize 


)  .  . 

< 

p*  - 

e  ,  select  x, 

’  kc 

p(> 

h 

lx.  .=1 
'  T'C 

or  xkc-l)  or 

p(, 

h 

X 

r*. 

II 

SI  xko=o) 

Note  that  p^2^  >  p...  That  is,  we  are  forcing  the 


initial  value  of  p .  .  higher. 


b)  If  p.  .  >  P*  +  e,  select  x.  to  maximize 

’  kc 


p(nll*£i  =  l  and  xkc-l)  , 
P("llxij*l  —  Xkc*0)- 


In  this  instance,  we  are  forcing  the  value  of  p 
lower . 

c)  If  P*  -  e  <  p.  .  <  P*  +  e,  stop. 

T'  0 

Step  3 :  Repeat  step  2  starting  with  the  composite  variable  x^zK 

Although  this  procedure  seems  reasonable,  it  has  a 
number  of  potentially  critical  drawbacks: 

1)  The  procedures  in  step  2  do  not  ensure  that  the  solution 
ever  falls  in  the  desired  range  for  p.  In  practice, 
step  2b)  tends  to  force  the  value  of  p  well  below  P*  -  e 
because  of  the  and  operator.  Thus,  the  results  may 
continuously  oscillate  above  and  below  the  desired  range. 

2)  The  procedure  reduces  the  power  of  the  General  Sequen¬ 
tial  Algorithm  since  only  two  of  the  four  combinations 
may  be  considered  at  any  stage. 

3)  The  procedure  may  reach  the  desired  range  of  P*  early  in 

the  process.  The  solution  will  be  feasible  but  possibly 
far  from  optimal.  j 


1 2 


MAXIMUS 


The  aim,  therefore,  is  to  develop  another  algorithm  that 
avoids  these  drawbacks.  Consider  the  following,  for  instance: 

•  Constrain  solutions  at  each  stage  so  that: 

|p(h+D  _  p*|^|p(h)  _  P*|#  (l) 

r  h ) 

That  is,  at  each  stage  h  the  value  of  '  must 
be  closer  to  P* . 


•  Start  with  an  unconstrained  problem  minimizing 
the  probability  of  misclassif ication ,  or  maxi¬ 
mizing  V,  for  example,  using  the  Generalized 
Sequential  Algorithm.  Then,  at  (pre-specif ied) 
stage  s  apply  the  constrained  sequential  algorithm 
using  the  constraint  of  (1). 

This  algorithm  is  given  below. 

Algorithm  5 . 6 . 1 . 2 

Step  1 :  Select  x^  ■  to  maximize  0  as  in  the  Generalized  Sequential 

Algorithm. 

Step  2:  Select  xk  to  maximize  0( 2)  where  the  maximization  is 

over  all  four  possible  combinations. 

Step  3 :  Continue  until  stage  s  is  completed.  Then 

a)  If  P*  -  e  ^  p( s)  ^  P*  +  e ,  stop.  (2) 

b)  If  (2)  does  not  hold,  select  xa^  to  maximize 

6(  s+l)  over  all  possible  combinations  with  x^s^ 


for  which 


|p(s+D  _  p*  |  <  |p(s)  _  P*  |  , 

Step  4 :  Continue  step  3  until  (2)  holds. 

This  algorithm  appears  to  retain  the  maximum  power  of 
the  General  Sequential  Algorithm  while  ultimately  forcing  the 
solution  to  the  desired  range.  The  only  danger  is  if  the 
solution  converges  slowly.  This  may  be  solved  by  setting  an 
upper  limit  on  the  number  of  stages,  s  +  t  say  (t  >  0),  then 
repeating  the  algorithm  with  a  different  starting  variable. 


We  term  the  type  of  algorithm  exemplified  by  algorithm 
2  as  a  "delayed  constraint"  algorithm  in  the  sense  that  the 
constraint  (s)  on  the  solution  are  not  considered  until  the 


V-13 


MAXIMUS 


optimization  process  has  had  a  chance  to  work, 
cation  is  as  follows: 


Another  modifi- 


Algorithm  5 .6 . 1 . 3 

1)  Apply  the  Generalized  Sequential  Algorithm  for  the 
first  s  stages. 

2)  After  stage  s,  continue  the  Generalized  Sequential 
Algorithm  until  a  solution  occurs  which  satisfies 
the  constraint,  or  until  stage  s  +  t,  whichever 
comes  first. 


3)  If  a  feasible  solution  has  not  occurred  by  stage  s  +  t, 
apply  the  constrained  algorithm  to  subsequent  stages. 

This  algorithm  appears  to  hold  the  greatest  promise  for 
performing  consistently  well,  since  it  provides  the  most  oppor¬ 
tunity  for  the  sequential  approach  to  arrive  at  a  good  solution 
before  the  constraints  are  applied.  At  best,  the  unconstrained 
solution  may  happen  to  satisfy  the  constraints.  At  worst,  the 
algorithm  may  result  in  an  expression  that  involves  more  variable 
values  than  might  be  preferred. 

Another  type  of  constraint  that  is  of  interest  is  the 
form  of  the  output.  For  the  Generalized  Sequential  Algorithm 
the  output  is  of  the  form: 


*102X2)03X3)04X4 j 


■°hxh 


where  x.  denotes  the  variable  value  selected  for  entry 
at  the  tth  stage,  and 

o  .  denotes  the  logical  operator  ( and ,  and  not ,  or , 
or  not)  selected  at  the  i th  stage, 

with  decision  rule  D  =  <Di ,  D2>  so  that 

D,  -  =  1} 

D2  -  |x|x(h)  -  ol 

Note  that  this  output  format,  while  requiring  no  mathe¬ 
matical  computation,  is  not  easy  to  understand  at  an  intuitive 
level  because  of  the  nesting  of  parenthesis.  In  order  to  be 
responsive  to  Property  3,  then,  we  consider  algorithms  that  are 
constrained  to  a  pre-speci f ied  output  form,  as  discussed  in  the 
next  chapter. 


J 


MAXIMUS 


5.7 


Pre-Specif ied  Form  of  Output 


We  have  already  seen  that  using  logical  operators  is  pre¬ 
ferable  to  mathematical  operators  when  dealing  with  qualitative 
variables.  However,  the  output  may  still  be  complicated,  as 
discussed  above.  Therefore,  we  consider  here  the  development  of 
procedures  under  the  "constraint"  of  a  pre-specif ied  form  of 
output . 

5.7.1  Form  1 :  V  or  W 


Let  the  output  have  the  form: 

(v i  and  v2  and  V3  and  ...  vg )  or  (wj  and  w2  and  .  .  .  and  J ,  ( 1 ) 

where  the  v^  and  w.  are  indicator  variables  selected  by  the 
algorithm ,  e . g . ,  J 


v . 


f1  if  xkc=0 

(0  lf  4c*1 


is  an  indicator  variable  which  takes  the  value  1  whenever  x,  does| 
not  occur.  Then,  the  output  of  (1)  implies  that 


Dl  =  | x | ( v x  = 1  and  v2  =  l 


or 


and  ...  v  =l) 

(wA  =  l  and  w2=l  and  . . .  and  wt=lj| 

The  expression  in  (1)  is  simplified  notation  for  (2) 
We  can  also  rewrite  (1)  as 


(2) 


F  =  V  or  W. 
s  —  t 


(3) 


where 


Vs  =  (vi  and  v2  and  . . .  and  vgj 
W^.  =  (wi  and  w2  and  .  .  .  and  w  ) 


Thus,  the  output  is  seen  to  be  the  union  of  two  profiles 
each  of  which  is  formed  using  the  and  operator.  To  determine 
whether  a  new  case  belongs  to  n1;  the  user  needs  only  check 
whether  the  case  has  all  the  characteristics  of  V  or  all  the 
characteristics  of  W. 

Clearly,  the  output  form  could  be  extended  to  include 
three  or  more  profiles,  i  e., 


V- 15 


MAXIMUS 


r 

F  =  V  or  W.  or  Z  or  .  .  . 
s  —  t  —  u  — 

where  Z  =  (z  i  and  z  •>  and  ...  and  z  ) 

At  this  point,  there  are  no  constraints  on  the  component 
profiles  V,  W,  Z.  For  example,  we  could  have  v.  =  w.  for  some 
( i,j ).  That  is,  variables  included  in  one  comp&nent^ prof ile 
could  be  repeated  in  another  component.  The  next  question,  then, 
is  how  can  the  profiles  be  developed. 

Algorithm  5 . 7 . 1  . 1 

In  this  approach  we  use  sample  information  to  develop 

the  profile.  That  is,  rank  each  individual  variable  value  (or 

the  absence  of  the  variable  value)  with  respect  to  the  objective 

function  of  interest.  Then,  relabel  each  of  the  variable  values 

in  order  (f rom  best  to  worst )  as  Vi  ,  v? ,  .  . .  ,  v  , 

1  ’  z  ’  m 

k 

where  m  =  2ls.. 

•  i  i 

t  =  l 

Profile  Vs  is  developed  by  taking  v: ,  v2  ...  until  s 
variables  are  included:  the  value  s  may  be  predetermined  or  be 
defined  such  that 

P  (v i  and  v2  and  ...  vg^ ^  p* 

but  p(vl  and  v2  and  ...  vg  and  v  <  P* , 

where  p*  is  a  pre-speci f ied  level  of  p  to  ensure  that  the  result¬ 
ing  profile  does  not  have  a  trivial  proportion  of  cases. 

In  practice,  we  may  find  that  the  joint  probability 
rapidly  approaches  zero.  Thus,  the  approach  could  involve  the 
following  modification,  assuming  a  minimum  p  =  p*  again,  which 
would  help  to  prevent  a  rapid  drop  in  the  probability  of  the 
profile.  Assume  that  the  profile  is  to  include  s  variables. 

Step  1 :  Select  the  first  variable  v.  in  the  sequence  vj ,  V2 , 

. . . ,  for  which 

P(v.)  > 

Enter  this  variable,  relabeled  as  v( 1 ) . 

Step  2 :  Select  the  first  variable  v.  in  the  sequence  v1(  v2, 

. . .  for  which  J 


(4)| 


J 


MAXIMUS 


Step  3 : 


A 

P(v(!)  and  v  .)  >  s  \/  p* .  Relabel  v„-  as  v(2^. 

'  J  ' 

Continue  step  3  until  s  variables  have  been  selected, 
i.e.,  at  stage  h,  select  the  first  variable  v,  in  the 
sequence  vx  ,  v2,  ...  for  which 

P  (v(  1 )  and  v(2)  and  ...  and  v^h-^  and  v^) 


> 


s-(h-l)/ 


(5) 


At  stage  s,  then,  we  have 

P  (v^  1  ^  and  v^'^  and  ...  and  v^s\)  >  p*  as  desired. 


Note:  If,  at  any  stage  h,  a  variable  cannot  be  found 

that  satisfies  the  inequality  (5),  the  process  stops  and 
the  profile  V  will  contain  only  h  -  1  variables. 

Now,  assume  the  realized  value  of  P(V)  is  pv .  The  next 
step  is  to  construct  the  other  component,  W,  of  the  profile 
F  =  V  or  W.  Assume  that  the  overall  target  for  the  profile  F  is 


P(F)  -  p*. 

Then,  the  target  value  for  the  profile  W  is 

p*  =  p*  -  p  (6) 

w  v  ' 

since  we  will  generally  find  that 

P(F)  =  P(V)  +  P(W)  with  P(VOW)  =  0. 

W  can  be  constructed  in  the  same  way  as  V  using  crite¬ 
rion  (5)  on  the  new  target  value  of  p^.  However,  if  p*  <  p* , 
and  the  same  number  of  variables  are  to  be  selected,  profile  W 
will  be  identical  to  V.  Or,  W  can  be  constructed  by  restricting 
selection  to  variables  not  selected  in  construction  of  V. 

Finally,  W  can  be  constructed  by  repeating  the  process  on  only 
those  observations  for  which  V  ^  1.  This  last  approach  bears 
some  resemblance  to  the  AID  algorithm  in  that  profile  V  is  the 
first  "optimal"  split  of  the  data  base.  W  represents  the  optimal 
split  in  the  remaining  portion  of  the  data  base.  This  process 
can  then  be  repeated  for  further  profiles  Z  etc...  so  that 

F=VorWorZor  ..., 


where  each  component  is  made  up  of  a  series  of  variables 
connected  by  and  operators. 


MAXIMUS 


r 


Mote  that  this  algorithm  does  not  follow  the  procedures 
of  the  General  Sequential  Algorithm.  That  is,  the  variables  are 
selected  in  order  of  their  unconditional  effect  on  the  objective 
function,  rather  than  in  a  stepwise  fashion  based  on  their  effect 
on  the  objective  function  given  the  variables  already  selected. 
Thus,  this  algorithm  loses  some  of  the  power  to  handle  inter¬ 
action  effects.  The  sequential  approach  is  discussed  next. 


ri 


Algorithm  5.7. 1.2 


Step  1  : 


Form  the  profile  V  exactly  as  with  the  General  Sequen¬ 
tial  Algorithm,  except  that  only  the  combinations  and 
or  and  not  are  considered.  Again,  the  variable  values 
x^  .  can  be  replaced  by  indicator  variables  v  such  that 


v 


1  if  X.  »  x. . 
0  if  4  xV. 

a  l  J 


(7) 


1 


Similarly  the  expression  not  x  -  •  can  be  replaced  by  an 
indicator  variable  v  such  that 


v 


1  if 

0  otherwise 


(8) 


Thus  the  profile  V  consists  of  variables  v  or  v  linked 
by  the  and  operator,  i.e., 


Vs  =  ^vj  and  v2  and  V3  and  . . .  vgj 
where  each  v^  can  be  expressed  in  form  (7)  or  (8). 


(9) 


As  with  algorithm  1,  s  may  be  predetermined.  Or,  we  may 
apply  a  constraint  to  the  selection  at  each  stage  such  as  in  (5) 
in  order  to  achieve  a  minimum  specified  value  of  p* .  Assume  that 
the  realized  value  of  P(VS)  is  pv  and  that  the  target  overall 
value  for  P(F)  is  p|.  Then,  the  target  value  for  P(W)  is,  as 
before, 

p*  =  p*  -  p  . 

*w  *v 

Step  2:  As  in  algorithm  1,  the  profile  W  may  be  constructed  in 
several  ways: 

•  restricting  W  to  variables  not  included  in  V; 


•  developing  W  in  the  subset  of  observations  for 
which  V  =(=  1 . 

V _ J 


V-  18 


MAXIMUS 


•  Developing  W  independently  of  V,  but  applying 
the  restriction  in  (5)  to  ensure  the  target 


value  of  p*  is  achieved. 


Algorithm  5. 7, 1.3:  Optimal  Subsets 


The  algorithms  discussed  so  far  have  involved  the 
selection  of  variables  one  at  a  time.  An  alternative  procedure 
is  to  consider  optimal  subsets  of  variables.  Consider  again  the 
general  profile  form: 


F  =  VorWorZ 


where  each  component  of  the  profile  is  made  up  of  variables 
connected  by  the  and  operator,  e.g. , 


Vs  =  (v i  and  v2  and 


.  and  vg  j . 


The  aim  now  is  to  select  the  combination  of  variables 
( vj ,  v2 ,  ....  vs)  to  maximize  the  objective  function  of  interest. 

In  general,  s  would  be  relatively  low,  i.e.,  less  than  4.  For 
s  =  2  or  3,  the  computational  burden  necessary  to  compute  the 
objective  function  for  all  possible  subsets  is  not  too  great, 
especially  for  small  sample  size.  Denote  the  feasible  subsets, 

ranked  in  order  of  the  objective  function,  as  S),  S2 . Sm. 

Then,  F  could  be  formed  in  che  following  ways: 


etc. . . 


Or,  F  could  be  formed  by  computing  all  possible 
triples  S-i  o£  S j  o_r  and  select  that 

triple  with  the  maximum  value  of  the  objective 
function ,  i.e.,  if 


.  =  max  I  0  .  ..  =  0  (  S  .  or  S  .  or  S,  ) 

f  i.j.k  \  t3<  \  l  -  3  —  k) 


set  V  =  S  , ,  W  =  S  ,  Z  =  Se . 

d  e  t 


In  practice,  the  maximization  could  be  restricted 
to  the  top  x%  of  the  S/  since  it  is  unlikely  that 
the  optimal  combination  would  be  found  among  the 
subset  that  perform  relatively  poorly  with 
respect  to  the  objective  function. 


•  As  usual,  constraints  on  the  resulting  value  of 
p,  p* ,  say,  may  affect  the  approach.  For  example, 
'he  maximization  in  (11)  could  be  restricted  to 


9 


MAXIMUS 


(  ' 

those  triples  for  which 

P*  -  £  —  P  ^  or  S  .  or  SjJ  ^  p*  +  e ,  (12) 

or  the  sequence  in  (10)  could  be  continued  until 
h  subsets  have  been  selected  such  that 

P { S  i  or  S2  or  ...  or  Sh_1)  <  p*  but 

P  (s  i  or  S2  or  ...  o£  S^  j  p*. 

5.7.2  Form  2:  V  and  W 

In  this  section,  we  consider  output  of  the  form 

F  =  V  and  W  ,  where  (13) 

s  -  t 

Vs  =  |vi  ojt  v2  or  . . .  or  vg j 

ft’t  =  (w,  or  w o  or  ...  or  w^ 

This  form  is  like  Form  1  but  with  the  roles  of  ojr  and  and 
reversed.  Thus  we  consider  similar  algorithms.  Because  of  th.s 
similarity,  details  of  these  algorithms  are  not  provided  here 

Algorithm  5.7. 2.1 


Select  variables  for  V  in  order  of  their  performance 
with  respect  to  the  objective  function.  If  a  target  p*  is 
desired,  the  procedure  should  be  modified  so  that 

p*  <  P  (V)  <  P*  + 

At  each  stage  h,  we  could  restrict  the  selection  to  the  first 
variable  v'“)  such  that 

p*  <  P  (  v( 1 )  or  v( 2  ^  0£  .  .  .  or  .  .  .|  S  P*  +  r  ( 14 ) 

h  .  s  h  .  e 

Constraint  (14)  is  the  analog  of  (5)  for  the  0£ 
combinat ion . 

IV  may  then  be  constructed  as  in  5. 7. 1.1. 

Algorithm  5. 7. 2 .2 


I  Form  the  profile  V  using  the  General  Sequential  Algo- 

restrictgd  to  the  21  and  if- ,tncg  ,h 


7-2  p 


/  AD-A081  803  MAXIMUS  INC  MCLEAN  VA  F/«  12/1 

FURTHER  RESEARCH  INTO  A  NON-PARAMETRIC  STATISTICAL  SCREENING  SY— ETCIUI 
DEC  79 

UNCLASSIFIED  NL 


MAXIMUS 


r 


variables  v^  can  be  used  to  indicate  XjjC  or  not  x,-c,  the  combi¬ 
nations  are,  in  effect,  restricted  to  or.  Then,  W  can  be 
constructed  as  outlined  in  Step  2  of  Algorithm  5. 7. 1.1. 


Algorithm  5. 7. 2. 3 

Develop  V  by  finding  the  optimal  (sample-based)  subset 
of  s  variables  linked  by  or.  If  these  subsets  are  denoted  Si, 

S 2 >  in  order  of  performance  on  the  objective  function,  choose 


•  V=Si;W=S2;Z=S3 

•  Or,  find  the  optimal  combination  of  S^  and  S;  and  Sj,  , 
i,e.,  find  S^,  Se  and  Sf  such  that,  for  d^e^x, 

0,  ^  .  =  max  (0.  .  v  =  0  (s.  and  S.  and  St, 

d’e’f  s  ^  -  3  -  k 

•  Develop  optimal  subsets  with  constraints  on  the 
objective  function. 


5.8  Evaluation 


The  procedures  described  here  have  a  great  deal  of  intuitive 
appeal.  However,  the  complexity  of  the  logical  expressions  being 
developed  makes  it  difficult  to  develop  closed-form  results  that 
would  establish  statistical  properties,  for  example,  the  con¬ 
vergence  of  the  sequence  of  values  to  the  true  optimal 

value  0* . 

As  discussed  in  5.2.2,  the  General  Sequential  Algorithm  does 
satisfy  the  five  properties  defined  in  Chapter  II.  Nonetheless, 
a  test  of  the  approach(es)  would  be  appropriate.  The  steps 
involved  in  such  a  test  are: 

1)  Develop  a  general  computer  program  to  handle  the 
General  Sequential  Algorithm  and  its  variations.  This 
program  would  be  set  up  so  that  the  user  could  specify 
in  advance  the  objective  function  desired,  constraints 
on  system  parameters,  the  format  of  output,  and  the 
stopping  rule. 

2)  Obtain  a  data  base  of  observations  from  a  population 

n  =  <  Hi,  n2  > with  each  observation  described  by  a  set 
of  categorical  variables  X3 . X^. 

3)  Randomly  divide  the  data  base  into  two  halves,  holding 
one  back . 

V _ J 


V-21 


MAXIMUS 


r 

4)  Apply  the  General  Sequential  Algorithm,  and  its  var¬ 
iations,  to  the  other  half.  For  each  approach  used, 
obtain  the  value  of  the  objective  function  Og. 

5)  For  each  profile  developed  in  4.,  calculate  the  value 
of  the  objective  function  on  the  hold-out  sample,  6A. 
Compare  0A  -  Og  (where  possible,  u§e  estimates  of 

the  asymptotic  variance  of  0A  and  Og  to  perform  statis¬ 
tical  tests  of  hypothesis,  e.g.,  the  procedure  for  doing 
this  for  the  objective  function  V  was  provided  in 
Chapter  II). 

6)  Apply  the  LDF,  Regression  Analysis  and  AID  techniques 
to  the  same  data  base.  Compare  the  7  results  for  these 
procedures  on  the  test  and  hold-out  sample  as  in  5. 

Also*  compare  the  7  results  for  these  procedures  with 
the  V  results  for  the  General  Sequential  Algorithm  and 
its  variations. 

We  have,  however,  some  empirical  evidence  of  the  effective¬ 
ness  of  one  approach  as  applied  to  the  Medicaid  data  base,  as 
shown  in  Table  5.2  below: 


TABLE  5.2:  TEST  RESULTS 


Type 

of 

Case 

1 

Profile 

Number 

P 

q 

e 

7 

P  (Mis- 

classif i cat ion) 

AI 

1.1 

0.1 

0.76 

0.17 

0.50 

0.12 

1.2 

0.2 

0.61 

0.17 

0.58 

0.12 

NH 

2.1 

0.1 

0.92 

0.30 

0.36 

0.22 

2.2 

0.2 

0.77 

0.30 

0.44 

0.19 

AFDC 

3.1 

0.1 

0.77 

0.27 

0.32 

0.22 

3.2 

0.2 

0.58 

0.27 

0.36 

0.24 

These  results  were  based  on  a  version  of  the  algorithm  with 
pre-specif ied  output  form,  of  the  type  5.7.2,  with  constraints 
on  the  value  of  p  (0.1  or  0.2).  Again,  the  intent  was  not  to 
maximize  v>  but  to  maximize  q  for  the  fixed  value  of  p.  Thus 
the  7  values  tend  to  understate  what  could  have  been  achieved. 
Furthermore,  since  the  solution  was  highly  constrained,  we  can 

l _ _ _ _ _ J 


V-22 


MAXIMUS 


f- 


(  ' 

hypothesize  that  results  might  have  been  superior  if  the  uncon¬ 
strained  General  Sequential  Algorithm  had  been  used. 

5.9  Summary 

In  this  Chapter,  we  have  introduced  a  new  class  of  screening 
techniques  for  handling  qualitative  variables.  Specifically, 
we  have  done  the  following: 

•  defined  a  General  Sequential  Algorithm  that 
can  be  used  with  a  wide  range  of  objective 
functions; 

•  developed  algorithms  that  can  be  used  to 
achieve  a  pre-specif ied  form  of  output. 

In  general,  these  algorithms  were  developed  from  an 
intuitive  and  practical  standpoint.  Hence,  the  choice  among 
algorithms  is,  at  this  point,  largely  a  function  of  the  user's 
preference.  For  example,  if  the  form  of  the  output  is  of  para¬ 
mount  concern,  then  one  of  the  algorithms  in  5.7.1  and  5.7.2 
should  be  adopted. 

The  algorithms  were  also  developed  under  the  assumption  that 
no  human  intervention  would  be  used  in  the  construction  of 
profiles.  In  practice,  however,  human  intervention  may  be  used 
just  as  it  is  in,  say,  regression  analysis  where  a  large  number 
of  regression  runs  may  be  used  to  guide  the  final  model  specifi¬ 
cation  and  estimation.  The  flexibility  of  the  approaches 
described  here  is  such  that  different  techniques  can  be  applied 
to  the  same  problem,  changes  in  the  number  of  variables  to  be 
included  can  be  made,  constraints  can  be  relaxed,  etc....  The 
user  may  then  choose  the  "best"  among  several  solutions  using 
criteria  above  and  beyond  those  inherent  in  the  objective 
function . 

Thus,  we  have  attempted  to  widen  substantially  the  range  of 
techniques  available  to  those  interested  in  discriminating 
between  two  populations  based  on  the  qualitative  characteristics 
of  cases  in  each  population. 

However,  the  algorithms  are  all  of  the  "search"  type  wherein 
the  data  are  analyzed  in  depth  in  order  to  identify  effective 
decision  rules  D.  As  noted,  the  rules  are  developed  Ex  Post  to 
fit  the  data.  Because  of  the  flexibility  of  the  approaches ,  'he 
degrees  of  freedom  are  very  high.  Thus,  it  is  very  important 
that  the  methodologies  only  be  applied  in  instances  where  it  is 
possible  to  test  the  results  on  a  hold  back  or  new  sample  before 


MAXIMUS 


adopting  the  solution.  The  techniques  are  very  powerful  from 
the  viewpoint  of  explaining  the  data,  but  this  power  may  be 
dangerous  if  used  carelessly. 


Furthermore,  the  algorithms  can  be  expected  to  require 
extensile  computer  time  because  of  the  number  of  calculations  to 
be  performed  at  each  stage.  The  rapid  advances  in  computer 
technology  have  made  such  approaches  possible.  It  seems  appro¬ 
priate  that  statistical  technology  keep  pace  with  the  power  of 
computers.  Hopefully,  the  work  presented  here  is  a  step  in  this 
direction. 


-2 


MAXIMUS 


C  REFERENCES  ^ 


Aitchison,  J.  and  Aitken,  C.G.G.  C 19763 .  "Multivariate  Binary 

Discrimination  by  the  Kernel  Method."  Biometrika  63,  No.  3, 
413-20. 

Baker,  Kenneth  and  Albaum,  Gerald  C19763 .  "The  Sampling  Prob¬ 
lem  in  Validation  of  Multiple  Discriminant  Analysis." 
Journal  of  Market  Research  Society  18,  No.  3,  158-61. 

Bendel ,  Robert  B.  and  Afifi,  A. A.  C19773.  "Comparison  of 

Stopping  Rules  in  Forward  'Stepwise'  Regression."  JASA  72, 
No.  357,  46-53. 

Bishop,  Y.M.M.;  Fienberg,  S.E.  and  Holland,  P.W.  C19753 . 

Discrete  Multivariate  Analysis:  Theory  and  Practice. 
Cambridge,  Mass.:  MIT  Press. 

Broffit,  James  D. ;  Randles,  Ronald  H. ;  and  Hogg,  Robert  V. 

C19763.  "Distribution  Free  Partial  Discriminant  Analysis." 
JASA  71,  No.  356. 

Chickner,  Robert  P.  C19763 .  "On  Least  Squares  Estimation  for 
Categorical  Data."  Communications  in  Statistics  -  Theory 
and  Methods.  A5.  No.  11,  1059-64. 

Cochran,  W.G.  C 19641 .  "On  the  Performance  of  the  Linear  Dis¬ 
criminant  Function."  Technometrics  6,  179-90. 

Cochran,  W.G.  and  Hopkins,  C.E.  C19613 .  "Some  Classification 
Problems  with  Multivariate  Qualitative  Data."  Biometrics 
17,  10-32. 

Costner,  H.  C19653 .  "Criteria  for  Measures  of  Association." 
American  Sociological  Review  30,  341-353. 

David,  Jean  M.  and  Mclver,  Carolyn.  C19763  .  "An  Application  of 
Multivariate  Discriminant  Analysis  to  Perceptions  of 
Student  Personnel  Services.”  American  Statistical  Assoc¬ 
iation.  Proceedings  of  the  Social  Statistics  Section, 

Part  I ,  270-375. 


Dawson,  Beth  C19763 .  "A  Review  of  the  Applicability  of  Dis¬ 
criminant  Analysis  to  Social  Science  Research."  American 
Statistical  Association,  Proceedings  of  the  Social 
Statistics  Section,  Part  I,  276-281. 


MAXIMUS 


Dillon,  W.R.  and  Goldstein,  M.  C19783.  "On  the  Performance  of 
Some  Multinomial  Classification  Rules."  JASA  78,  No.  362. 

DiPillo,  P.J.  C19703.  "The  Application  of  Bias  to  Discriminant 
Analysis."  Communications  in  Statistics,  Theory  and 
Method,  A5.  No.  9,  843-54. 

Dunn,  O.J.  and  Varady,  P.D.  C1966D .  "Probabilities  of  Correct 
Classification  in  Discriminant  Analysis."  Biometrics  22, 
908-24. 

Gail,  Mitchell  M.  and  Green,  Sylvan  B.  [1976] .  "A  General¬ 
ization  of  the  One-Sided  Two  Sample  Kolmogorov-Smirnov 
Statistic  for  Evaluating  Diagnostic  Tests."  Biometrics  32 
No.  3,  561-70. 

Gilbert,  E.S.  C1969] .  "On  Discrimination  Using  Qualitative 
Variables."  JASA  63,  1399. 

Goldstein,  M.  and  Wolf,  C.  C19773 .  "On  the  Problem  of  Bias 

in  Multinomial  Classification."  Biometrics  33,  325-331. 

Goodman,  Leo  A.  and  Kruskal ,  William  M.  C19541 .  "Measures  of 
Association  for  Cross-Classifications."  JASA  49,  732-64. 

Grizzle,  J.E.;  Starmer,  C.F.  and  Koch,  G.G.  C19693 .  "Analysis 
of  Categorical  Data  by  Linear  Models."  Biometrics  25, 
489-504. 

Harushek,  Eric  A.  and  Jackson,  John  E.  C19771 .  Statistical 
Methods  for  Social  Scientists.  Academic  Press,  Inc., 

New  York. 

Hartigan,  John  A.  C19753  .  Clustering  Algorithms.  John  Wiley 
and  Sons,  Inc. 

Hildebrand,  David  K.;  Laing,  James  D.  and  Rosenthal,  Howard 
C1977D .  Prediction  Analysis  of  Cross  Classifications, 

John  Wiley  and  Sons,  Inc . 

Hills,  M.  C19663.  "Allocation  Rules  and  Their  Error  Rates." 
Journal  of  the  Royal  Statistical  Society,  B28.  1-31. 

Lachenbruch,  P.A.  and  Mickey,  M.R.  C19683 .  "Estimation  of 

Error  Rates  in  Discriminant  Analysis."  Technometrics  10, 
No.  1. 


MAXIMUS 


Landis,  Richard  J. ;  Freeman,  Jean  L. ;  Stanish,  William  M. ; 

Koch,  Gary  G. ;  and  Lewis,  Alcinda  L.  C19763 .  "GENCAT:  A 
Computer  Program  for  the  Generalized  Least  Squares  Analysis 
of  Multivariate  Categorical  Data."  American  Statistical 
Association.  Proceedings  of  the  Statistical  Computing 
Section ,  190-195. 

Lehmann,  E.L.  [19591.  Testing  Statistical  Hypotheses.  New 
York,  John  Wiley  and  Sons,  Inc. 

Martin,  D.C.  and  Bradley,  R.A.  [19721  .  "Probability  Models 

Estimation  and  Classification  for  Multivariate  Dichotomous 
Populations."  Biometrics  28,  203-222. 

Matsuita,  K.  [19541.  "On  Estimation  by  the  Minimum  Distance 
Method."  Am.  Inst.  Stat.  Math.  7,  67-77. 

-  [19551 .  "Decision  Rules  Based  on  the  Distance  for 
Problems  of  Fit,  Two  Samples  and  Estimation."  Am.  Math 
Stat.  26,  631-40. 

-  [19571.  "Classification  Based  on  Distance  in  Multi¬ 
variate  Gaussian  Cases."  Proc.  Fifth  Berkeley  Symp. 

Math.  Stat.  and  Prob ■ ,  1,  299-304. 

McDonald,  Lyman  L.;  Lowe,  Victor  W. ;  Smidt,  Robert  K. ;  Meister, 
Keven  A.  [19761.  "A  Preliminary  Test  for  Discriminant 
Analysis  Based  on  Small  Samples."  Biometrics  32,  No.  2. 
417-22. 

McLachlan,  G.J.  [1974].  "Estimation  of  the  Errors  of  Mis- 

classif ication  on  the  Criterion  of  Asymplotic  Mean  Square 
Error."  Technometrics  16,  256-60. 

McLachlan,  G.J.  [19761.  "The  Bias  of  the  Apparent  Error  Rate  in 
Discriminant  Analysis."  Biometrika  63,  239-244. 

McLachlan,  G.J.  [19761.  "A  Criterion  for  Selecting  Variables  for 
the  Linear  Discriminant  Function."  Biometrics  32,  No.  3, 
529-34. 

Miller,  R.G.  [19741.  "The  Jackknife  -  A  Review."  Biometrika 
61,  1-15. 

Moore,  D.H.  [19731.  "Evaluation  of  Five  Discriminant  Procedures 
for  Binary  Variables."  JASA  68,  399-404. 

Hosteller,  F.  [19681.  "Association  and  Estimation  in  Contin¬ 
gency  Tables."  JASA  63,  1-28. 


MAXIMUS 


r  \ 

New  Hampshire  Division  of  Welfare  C 19773.  First  Year  Report  on 
the  Title  XIX  Quality  Control  Project,  Project  #11-P-90147. 
Office  of  Research  and  Demonstrations,  DHEW. 

-  C  19781.  Seco:  i  Year  Report  on  the  Title  XIX  Quality 
Control  Project. 

-  C 19791.  Third  Year  Report  on  the  Title  XIX  Quality 
Control  Project. 

Oberhue,  H.  and  Ono ,  M.  C19761 .  "A  Statistical  Micro-Data 

Source  on  AFDC  Recipients."  American  Statistical  Assoc¬ 
iation.  Proceedings  of  the  Social  Statistics  Section, 

Part  II,  645-50. 


Powers,  John  A.;  March,  Lawrence,  C.;  Huckfeldt,  Robert  R. ; 

Johnson,  C.L.  C19781  .  "A  Comparison  of  Logit,  Probit  and 
Discriminant  Analysis  in  Predicting  Family  Size."  American 
Statistical  Association,  Proceedings  of  the  Social 
Statistics  Section,  693-698. 

Quesenberry,  and  Gessaman  C 19681 .  "Nonparametric  Discrimination 
Using  Tolerance  Regions."  Annals  of  Math.  Stat . ,  April, 
664-73. 

Reinmuth,  James  E.  and  Hawkins,  Del  I.  C19771 .  "Qualitative 
Variable  Discriminant  Analysis  and  Its  Use  in  Product 
Version  Selection."  Decision  Sciences  8,  No.  2.  478-88. 

Ralph,  John  E.;  Williams,  Albert  P.  and  Lee,  Carolyn  L.  C19781 . 
"The  Effect  of  State  of  Residence  on  Medical  School 
Admissions:  Empirical  Bayes  and  Least  Squares  Discriminant 

Estimators."  American  Statistical  Association,  Proceedings 
of  the  Social  Statistics  Section,  Part  I,  89-98. 

Scott,  A.J.  and  Knott,  M.  C19761  .  "An  Approximate  Test  for  Use 
with  AID."  Applied  Statistics  25,  No.  2,  103-106. 

Social  Security  Administration,  Office  of  Family  Assistance. 
[19793.  Conference  on  the  Utilization  of  Characteristic 
Profiles  as  a  Workload  Planning  Technique.  Conference 
Notebook . 

Sonquist,  John  A.;  Baker,  Elizabeth,  L.  and  Morgan,  James  N. 

C19733.  Searching  for  Structure.  University  of  Michigan, 
Ann  Arbor,  Michigan. 

Sorum,  M.  C19713.  "Expected  and  Optimal  Probabilities  of 

Misclassif ication . "  IEEE  Trans,  on  Information  Theory, 

IT-20,  472-479. 

V. _ _ _ / 


MAXIMUS 


Welch,  B.L.  C1939D.  "Note  on  Discriminant  Functions." 
Biometrika  31,  218-20. 


